{"id":"astro-ph/0701361v1","submitted":"2007-01-12 03:28:11","updated":"2007-01-12 03:28:11","title":"How to Make the Dream Come True: The Astronomers' Data Manifesto","abstract":" Astronomy is one of the most data-intensive of the sciences. Data technology\nis accelerating the quality and effectiveness of its research, and the rate of\nastronomical discovery is higher than ever. As a result, many view astronomy as\nbeing in a 'Golden Age', and projects such as the Virtual Observatory are\namongst the most ambitious data projects in any field of science. But these\npowerful tools will be impotent unless the data on which they operate are of\nmatching quality. Astronomy, like other fields of science, therefore needs to\nestablish and agree on a set of guiding principles for the management of\nastronomical data. To focus this process, we are constructing a 'data\nmanifesto', which proposes guidelines to maximise the rate and\ncost-effectiveness of scientific discovery.\n","authors":"Ray P Norris","affiliations":"","link_abstract":"http://arxiv.org/abs/astro-ph/0701361v1","link_pdf":"http://arxiv.org/pdf/astro-ph/0701361v1","link_doi":"","comment":"Submitted to Data Science Journal Presented at CODATA, Beijing,\n October 2006","journal_ref":"","doi":"","primary_category":"astro-ph","categories":"astro-ph"} {"id":"0901.2805v1","submitted":"2009-01-19 10:38:33","updated":"2009-01-19 10:38:33","title":"Safeguarding Old and New Journal Tables for the VO: Status for\n Extragalactic and Radio Data","abstract":" Independent of established data centers, and partly for my own research,\nsince 1989 I have been collecting the tabular data from over 2600 articles\nconcerned with radio sources and extragalactic objects in general. Optical\ncharacter recognition (OCR) was used to recover tables from 740 papers. Tables\nfrom only 41 percent of the 2600 articles are available in the CDS or CATS\ncatalog collections, and only slightly better coverage is estimated for the NED\ndatabase. This fraction is not better for articles published electronically\nsince 2001. Both object databases (NED, SIMBAD, LEDA) as well as catalog\nbrowsers (VizieR, CATS) need to be consulted to obtain the most complete\ninformation on astronomical objects. More human resources at the data centers\nand better collaboration between authors, referees, editors, publishers, and\ndata centers are required to improve data coverage and accessibility. The\ncurrent efforts within the Virtual Observatory (VO) project, to provide\nretrieval and analysis tools for different types of published and archival data\nstored at various sites, should be balanced by an equal effort to recover and\ninclude large amounts of published data not currently available in this way.\n","authors":"Heinz Andernach","affiliations":"","link_abstract":"http://arxiv.org/abs/0901.2805v1","link_pdf":"http://arxiv.org/pdf/0901.2805v1","link_doi":"http://dx.doi.org/10.2481/dsj.8.41","comment":"11 pages, 4 figures; accepted for publication in Data Science\n Journal, vol. 8 (2009), http://dsj.codataweb.org; presented at Special\n Session \"Astronomical Data and the Virtual Observatory\" on the conference\n \"CODATA 21\", Kiev, Ukraine, October 5-8, 2008","journal_ref":"","doi":"10.2481/dsj.8.41","primary_category":"astro-ph.IM","categories":"astro-ph.IM|astro-ph.CO"} {"id":"0901.3118v2","submitted":"2009-01-20 18:48:59","updated":"2009-01-24 19:23:47","title":"The CATS Service: an Astrophysical Research Tool","abstract":" We describe the current status of CATS (astrophysical CATalogs Support\nsystem), a publicly accessible tool maintained at Special Astrophysical\nObservatory of the Russian Academy of Sciences (SAO RAS) (http://cats.sao.ru)\nallowing one to search hundreds of catalogs of astronomical objects discovered\nall along the electromagnetic spectrum. Our emphasis is mainly on catalogs of\nradio continuum sources observed from 10 MHz to 245 GHz, and secondly on\ncatalogs of objects such as radio and active stars, X-ray binaries, planetary\nnebulae, HII regions, supernova remnants, pulsars, nearby and radio galaxies,\nAGN and quasars. CATS also includes the catalogs from the largest extragalactic\nsurveys with non-radio waves. In 2008 CATS comprised a total of about 10e9\nrecords from over 400 catalogs in the radio, IR, optical and X-ray windows,\nincluding most source catalogs deriving from observations with the Russian\nradio telescope RATAN-600. CATS offers several search tools through different\nways of access, e.g. via web interface and e-mail. Since its creation in 1997\nCATS has managed about 10,000 requests. Currently CATS is used by external\nusers about 1500 times per day and since its opening to the public in 1997 has\nreceived about 4000 requests for its selection and matching tasks.\n","authors":"O. V. Verkhodanov|S. A. Trushkin|H. Andernach|V. N. Chernenkov","affiliations":"Special Astrophysical Observatory, Nizhnij Arkhyz, Karachaj-Cherkesia, Russia;|Special Astrophysical Observatory, Nizhnij Arkhyz, Karachaj-Cherkesia, Russia;|Argelander-Institut fuer Astronomie, Universitaet Bonn, Bonn, Germany; on leave of absence from Depto. de Astronomia, Univ. Guanajuato, Mexico|Special Astrophysical Observatory, Nizhnij Arkhyz, Karachaj-Cherkesia, Russia;","link_abstract":"http://arxiv.org/abs/0901.3118v2","link_pdf":"http://arxiv.org/pdf/0901.3118v2","link_doi":"http://dx.doi.org/10.2481/dsj.8.34","comment":"8 pages, no figures; accepted for publication in Data Science\n Journal, vol. 8 (2009), http://dsj.codataweb.org; presented at Special\n Session \"Astronomical Data and the Virtual Observatory\" on the conference\n \"CODATA 21\", Kiev, Ukraine, October 5-8, 2008; replaced incorrect reference\n arXiv:0901.2085 with arXiv:0901.2805","journal_ref":"","doi":"10.2481/dsj.8.34","primary_category":"astro-ph.IM","categories":"astro-ph.IM|astro-ph.CO"} {"id":"0909.3895v1","submitted":"2009-09-22 02:55:14","updated":"2009-09-22 02:55:14","title":"The Revolution in Astronomy Education: Data Science for the Masses","abstract":" As our capacity to study ever-expanding domains of our science has increased\n(including the time domain, non-electromagnetic phenomena, magnetized plasmas,\nand numerous sky surveys in multiple wavebands with broad spatial coverage and\nunprecedented depths), so have the horizons of our understanding of the\nUniverse been similarly expanding. This expansion is coupled to the exponential\ndata deluge from multiple sky surveys, which have grown from gigabytes into\nterabytes during the past decade, and will grow from terabytes into Petabytes\n(even hundreds of Petabytes) in the next decade. With this increased vastness\nof information, there is a growing gap between our awareness of that\ninformation and our understanding of it. Training the next generation in the\nfine art of deriving intelligent understanding from data is needed for the\nsuccess of sciences, communities, projects, agencies, businesses, and\neconomies. This is true for both specialists (scientists) and non-specialists\n(everyone else: the public, educators and students, workforce). Specialists\nmust learn and apply new data science research techniques in order to advance\nour understanding of the Universe. Non-specialists require information literacy\nskills as productive members of the 21st century workforce, integrating\nfoundational skills for lifelong learning in a world increasingly dominated by\ndata. We address the impact of the emerging discipline of data science on\nastronomy education within two contexts: formal education and lifelong\nlearners.\n","authors":"Kirk D. Borne|Suzanne Jacoby|K. Carney|A. Connolly|T. Eastman|M. J. Raddick|J. A. Tyson|J. Wallin","affiliations":"George Mason University|LSST Corporation|Adler Planetarium|U. Washington|Wyle Information Systems|JHU/SDSS|UC Davis|George Mason University","link_abstract":"http://arxiv.org/abs/0909.3895v1","link_pdf":"http://arxiv.org/pdf/0909.3895v1","link_doi":"","comment":"12 pages total: 1 cover page, 1 page of co-signers, plus 10 pages,\n State of the Profession Position Paper submitted to the Astro2010 Decadal\n Survey (March 2009)","journal_ref":"","doi":"","primary_category":"astro-ph.IM","categories":"astro-ph.IM|cs.DB|cs.DL|cs.IR|physics.ed-ph"} {"id":"1106.2503v5","submitted":"2011-06-13 17:42:32","updated":"2013-06-23 21:21:41","title":"A Large-Scale Community Structure Analysis In Facebook","abstract":" Understanding social dynamics that govern human phenomena, such as\ncommunications and social relationships is a major problem in current\ncomputational social sciences. In particular, given the unprecedented success\nof online social networks (OSNs), in this paper we are concerned with the\nanalysis of aggregation patterns and social dynamics occurring among users of\nthe largest OSN as the date: Facebook. In detail, we discuss the mesoscopic\nfeatures of the community structure of this network, considering the\nperspective of the communities, which has not yet been studied on such a large\nscale. To this purpose, we acquired a sample of this network containing\nmillions of users and their social relationships; then, we unveiled the\ncommunities representing the aggregation units among which users gather and\ninteract; finally, we analyzed the statistical features of such a network of\ncommunities, discovering and characterizing some specific organization patterns\nfollowed by individuals interacting in online social networks, that emerge\nconsidering different sampling techniques and clustering methodologies. This\nstudy provides some clues of the tendency of individuals to establish social\ninteractions in online social networks that eventually contribute to building a\nwell-connected social structure, and opens space for further social studies.\n","authors":"Emilio Ferrara","affiliations":"","link_abstract":"http://arxiv.org/abs/1106.2503v5","link_pdf":"http://arxiv.org/pdf/1106.2503v5","link_doi":"http://dx.doi.org/10.1140/epjds9","comment":"30 pages, 13 Figures - Published on: EPJ Data Science, 1:9, 2012 -\n open access at: http://www.epjdatascience.com/content/1/1/9","journal_ref":"EPJ Data Science, 1:9, 2012","doi":"10.1140/epjds9","primary_category":"cs.SI","categories":"cs.SI|cs.CY|physics.soc-ph|91D30, 05C82, 68R10, 90B10, 90C35|H.2.8; D.2.8"} {"id":"1106.3305v1","submitted":"2011-06-16 18:45:32","updated":"2011-06-16 18:45:32","title":"The Art of Data Science","abstract":" To flourish in the new data-intensive environment of 21st century science, we\nneed to evolve new skills. These can be expressed in terms of the systemized\nframework that formed the basis of mediaeval education - the trivium (logic,\ngrammar, and rhetoric) and quadrivium (arithmetic, geometry, music, and\nastronomy). However, rather than focusing on number, data is the new keystone.\nWe need to understand what rules it obeys, how it is symbolized and\ncommunicated and what its relationship to physical space and time is. In this\npaper, we will review this understanding in terms of the technologies and\nprocesses that it requires. We contend that, at least, an appreciation of all\nthese aspects is crucial to enable us to extract scientific information and\nknowledge from the data sets which threaten to engulf and overwhelm us.\n","authors":"Matthew J. Graham","affiliations":"","link_abstract":"http://arxiv.org/abs/1106.3305v1","link_pdf":"http://arxiv.org/pdf/1106.3305v1","link_doi":"http://dx.doi.org/10.1007/978-1-4614-3323-1_4","comment":"12 pages, invited talk at Astrostatistics and Data Mining in Large\n Astronomical Databases workshop, La Palma, Spain, 30 May - 3 June 2011, to\n appear in Springer Series on Astrostatistics","journal_ref":"","doi":"10.1007/978-1-4614-3323-1_4","primary_category":"astro-ph.IM","categories":"astro-ph.IM|cs.DL"} {"id":"1110.4123v4","submitted":"2011-10-18 20:54:21","updated":"2012-05-27 14:55:40","title":"Positive words carry less information than negative words","abstract":" We show that the frequency of word use is not only determined by the word\nlength \\cite{Zipf1935} and the average information content\n\\cite{Piantadosi2011}, but also by its emotional content. We have analyzed\nthree established lexica of affective word usage in English, German, and\nSpanish, to verify that these lexica have a neutral, unbiased, emotional\ncontent. Taking into account the frequency of word usage, we find that words\nwith a positive emotional content are more frequently used. This lends support\nto Pollyanna hypothesis \\cite{Boucher1969} that there should be a positive bias\nin human expression. We also find that negative words contain more information\nthan positive words, as the informativeness of a word increases uniformly with\nits valence decrease. Our findings support earlier conjectures about (i) the\nrelation between word frequency and information content, and (ii) the impact of\npositive emotions on communication and social links.\n","authors":"David Garcia|Antonios Garas|Frank Schweitzer","affiliations":"","link_abstract":"http://arxiv.org/abs/1110.4123v4","link_pdf":"http://arxiv.org/pdf/1110.4123v4","link_doi":"http://dx.doi.org/10.1140/epjds3","comment":"16 pages, 3 figures, 3 tables","journal_ref":"EPJ Data Science 2012, 1:3","doi":"10.1140/epjds3","primary_category":"cs.CL","categories":"cs.CL|cs.IR|physics.soc-ph"} {"id":"1202.1145v1","submitted":"2012-02-06 14:16:14","updated":"2012-02-06 14:16:14","title":"Effects of time window size and placement on the structure of aggregated\n networks","abstract":" Complex networks are often constructed by aggregating empirical data over\ntime, such that a link represents the existence of interactions between the\nendpoint nodes and the link weight represents the intensity of such\ninteractions within the aggregation time window. The resulting networks are\nthen often considered static. More often than not, the aggregation time window\nis dictated by the availability of data, and the effects of its length on the\nresulting networks are rarely considered. Here, we address this question by\nstudying the structural features of networks emerging from aggregating\nempirical data over different time intervals, focussing on networks derived\nfrom time-stamped, anonymized mobile telephone call records. Our results show\nthat short aggregation intervals yield networks where strong links associated\nwith dense clusters dominate; the seeds of such clusters or communities become\nalready visible for intervals of around one week. The degree and weight\ndistributions are seen to become stationary around a few days and a few weeks,\nrespectively. An aggregation interval of around 30 days results in the stablest\nsimilar networks when consecutive windows are compared. For longer intervals,\nthe effects of weak or random links become increasingly stronger, and the\naverage degree of the network keeps growing even for intervals up to 180 days.\nThe placement of the time window is also seen to affect the outcome: for short\nwindows, different behavioural patterns play a role during weekends and\nweekdays, and for longer windows it is seen that networks aggregated during\nholiday periods are significantly different.\n","authors":"Gautier Krings|Márton Karsai|Sebastian Bernharsson|Vincent D Blondel|Jari Saramäki","affiliations":"","link_abstract":"http://arxiv.org/abs/1202.1145v1","link_pdf":"http://arxiv.org/pdf/1202.1145v1","link_doi":"http://dx.doi.org/10.1140/epjds4","comment":"19 pages, 11 figures","journal_ref":"EPJ Data Science, 2012, Volume 1, Number 1, 4","doi":"10.1140/epjds4","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.SI"} {"id":"1202.5840v1","submitted":"2012-02-27 07:37:09","updated":"2012-02-27 07:37:09","title":"On the Influence of the Data Sampling Interval on Computer-Derived\n K-Indices","abstract":" The K index was devised by Bartels et al. (1939) to provide an objective\nmonitoring of irregular geomagnetic activity. The K index was then routinely\nused to monitor the magnetic activity at permanent magnetic observatories as\nwell as at temporary stations. The increasing number of digital and sometimes\nunmanned observatories and the creation of INTERMAGNET put the question of\ncomputer production of K at the centre of the debate. Four algorithms were\nselected during the Vienna meeting (1991) and endorsed by IAGA for the computer\nproduction of K indices. We used one of them (FMI algorithm) to investigate the\nimpact of the geomagnetic data sampling interval on computer produced K values\nthrough the comparison of the computer derived K values for the period 2009,\nJanuary 1st to 2010, May 31st at the Port-aux-Francais magnetic observatory\nusing magnetic data series with different sampling rates (the smaller: 1\nsecond; the larger: 1 minute). The impact is investigated on both 3-hour range\nvalues and K indices data series, as a function of the activity level for low\nand moderate geomagnetic activity.\n","authors":"Armelle Bernard|Menvielle Michel|Aude Chambodut","affiliations":"EOSTS|LATMOS|EOSTS, IPGS","link_abstract":"http://arxiv.org/abs/1202.5840v1","link_pdf":"http://arxiv.org/pdf/1202.5840v1","link_doi":"http://dx.doi.org/10.2481/dsj.IAGA-07","comment":"","journal_ref":"Data Science Journal 10 (2011) 41-46","doi":"10.2481/dsj.IAGA-07","primary_category":"physics.geo-ph","categories":"physics.geo-ph"} {"id":"1204.2169v3","submitted":"2012-04-10 14:42:56","updated":"2012-09-26 12:16:40","title":"Spatiotemporal correlations of handset-based service usages","abstract":" We study spatiotemporal correlations and temporal diversities of\nhandset-based service usages by analyzing a dataset that includes detailed\ninformation about locations and service usages of 124 users over 16 months. By\nconstructing the spatiotemporal trajectories of the users we detect several\nmeaningful places or contexts for each one of them and show how the context\naffects the service usage patterns. We find that temporal patterns of service\nusages are bound to the typical weekly cycles of humans, yet they show maximal\nactivities at different times. We first discuss their temporal correlations and\nthen investigate the time-ordering behavior of communication services like\ncalls being followed by the non-communication services like applications. We\nalso find that the behavioral overlap network based on the clustering of\ntemporal patterns is comparable to the communication network of users. Our\napproach provides a useful framework for handset-based data analysis and helps\nus to understand the complexities of information and communications technology\nenabled human behavior.\n","authors":"Hang-Hyun Jo|Márton Karsai|Juuso Karikoski|Kimmo Kaski","affiliations":"","link_abstract":"http://arxiv.org/abs/1204.2169v3","link_pdf":"http://arxiv.org/pdf/1204.2169v3","link_doi":"http://dx.doi.org/10.1140/epjds10","comment":"11 pages, 15 figures","journal_ref":"EPJ Data Science 1, 10 (2012)","doi":"10.1140/epjds10","primary_category":"physics.soc-ph","categories":"physics.soc-ph|physics.data-an"} {"id":"1205.1010v2","submitted":"2012-05-04 17:10:43","updated":"2012-06-19 16:23:42","title":"Partisan Asymmetries in Online Political Activity","abstract":" We examine partisan differences in the behavior, communication patterns and\nsocial interactions of more than 18,000 politically-active Twitter users to\nproduce evidence that points to changing levels of partisan engagement with the\nAmerican online political landscape. Analysis of a network defined by the\ncommunication activity of these users in proximity to the 2010 midterm\ncongressional elections reveals a highly segregated, well clustered partisan\ncommunity structure. Using cluster membership as a high-fidelity (87% accuracy)\nproxy for political affiliation, we characterize a wide range of differences in\nthe behavior, communication and social connectivity of left- and right-leaning\nTwitter users. We find that in contrast to the online political dynamics of the\n2008 campaign, right-leaning Twitter users exhibit greater levels of political\nactivity, a more tightly interconnected social structure, and a communication\nnetwork topology that facilitates the rapid and broad dissemination of\npolitical information.\n","authors":"Michael D. Conover|Bruno Gonçalves|Alessandro Flammini|Filippo Menczer","affiliations":"","link_abstract":"http://arxiv.org/abs/1205.1010v2","link_pdf":"http://arxiv.org/pdf/1205.1010v2","link_doi":"http://dx.doi.org/10.1140/epjds6","comment":"17 pages, 10 figures, 6 tables","journal_ref":"EPJ Data Science 1, 6 (2012)","doi":"10.1140/epjds6","primary_category":"cs.SI","categories":"cs.SI|cs.HC|physics.soc-ph"} {"id":"1206.5453v3","submitted":"2012-06-24 00:57:27","updated":"2012-10-03 14:41:09","title":"Genetic flow directionality and geographical segregation in a Cymodocea\n nodosa genetic diversity network","abstract":" We analyse a large data set of genetic markers obtained from populations of\nCymodocea nodosa, a marine plant occurring from the East Mediterranean to the\nIberian-African coasts in the Atlantic Ocean. We fully develop and test a\nrecently introduced methodology to infer the directionality of gene flow based\non the concept of geographical segregation. Using the Jensen-Shannon\ndivergence, we are able to extract a directed network of gene flow describing\nthe evolutionary patterns of Cymodocea nodosa. In particular we recover the\ngenetic segregation that the marine plant underwent during its evolution. The\nresults are confirmed by natural evidence and are consistent with an\nindependent cross analysis.\n","authors":"Paolo Masucci|Sophie Arnaud-Haond|Víctor M. Eguíluz|Emilio Hernández-García|Ester A. Serrão","affiliations":"","link_abstract":"http://arxiv.org/abs/1206.5453v3","link_pdf":"http://arxiv.org/pdf/1206.5453v3","link_doi":"http://dx.doi.org/10.1140/epjds11","comment":"","journal_ref":"EPJ Data Science, 1:11, 2012","doi":"10.1140/epjds11","primary_category":"q-bio.PE","categories":"q-bio.PE|physics.bio-ph|physics.data-an"} {"id":"1208.1517v2","submitted":"2012-08-07 20:26:40","updated":"2013-01-26 16:13:50","title":"On the spatial correlation between areas of high coseismic slip and\n aftershock clusters of the Maule earthquake Mw=8.8","abstract":" We study the spatial distribution of clusters associated to the aftershocks\nof the megathrust Maule earthquake MW 8.8 of 27 February 2010. We used a recent\nclustering method which hinges on a nonparametric estimation of the underlying\nprobability density function to detect subsets of points forming clusters\nassociated with high density areas. In addition, we estimate the probability\ndensity function using a nonparametric kernel method for each of these\nclusters. This allows us to identify a set of regions where there is an\nassociation between frequency of events and coseismic slip. Our results suggest\nthat high coseismic slip spatially correlates with high aftershock frequency.\n","authors":"Javier E. Contreras-Reyes|Adelchi Azzalini","affiliations":"","link_abstract":"http://arxiv.org/abs/1208.1517v2","link_pdf":"http://arxiv.org/pdf/1208.1517v2","link_doi":"http://dx.doi.org/10.6339/JDS.2013.11(4).1188","comment":"16 pages, 5 figures","journal_ref":"Journal of Data Science (2013), 11(4), 623-638","doi":"10.6339/JDS.2013.11(4).1188","primary_category":"stat.AP","categories":"stat.AP|physics.geo-ph|stat.ME"} {"id":"1210.6636v1","submitted":"2012-10-24 19:24:59","updated":"2012-10-24 19:24:59","title":"Informaticology: combining Computer Science, Data Science, and Fiction\n Science","abstract":" Motivated by an intention to remedy current complications with Dutch\nterminology concerning informatics, the term informaticology is positioned to\ndenote an academic counterpart of informatics where informatics is conceived of\nas a container for a coherent family of practical disciplines ranging from\ncomputer engineering and software engineering to network technology, data\ncenter management, information technology, and information management in a\nbroad sense.\n Informaticology escapes from the limitations of instrumental objectives and\nthe perspective of usage that both restrict the scope of informatics. That is\nachieved by including fiction science in informaticology and by ranking fiction\nscience on equal terms with computer science and data science, and framing (the\nstudy of) game design, evelopment, assessment and distribution, ranging from\nserious gaming to entertainment gaming, as a chapter of fiction science. A\nsuggestion for the scope of fiction science is specified in some detail.\n In order to illustrate the coherence of informaticology thus conceived, a\npotential application of fiction to the ontology of instruction sequences and\nto software quality assessment is sketched, thereby highlighting a possible\nrole of fiction (science) within informaticology but outside gaming.\n","authors":"Jan A. Bergstra","affiliations":"","link_abstract":"http://arxiv.org/abs/1210.6636v1","link_pdf":"http://arxiv.org/pdf/1210.6636v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.SE","categories":"cs.SE"} {"id":"1303.4629v1","submitted":"2013-03-19 14:59:02","updated":"2013-03-19 14:59:02","title":"The role of hidden influentials in the diffusion of online information\n cascades","abstract":" In a diversified context with multiple social networking sites, heterogeneous\nactivity patterns and different user-user relations, the concept of\n\"information cascade\" is all but univocal. Despite the fact that such\ninformation cascades can be defined in different ways, it is important to check\nwhether some of the observed patterns are common to diverse contagion processes\nthat take place on modern social media. Here, we explore one type of\ninformation cascades, namely, those that are time-constrained, related to two\nkinds of socially-rooted topics on Twitter. Specifically, we show that in both\ncases cascades sizes distribute following a fat tailed distribution and that\nwhether or not a cascade reaches system-wide proportions is mainly given by the\npresence of so-called hidden influentials. These latter nodes are not the hubs,\nwhich on the contrary, often act as firewalls for information spreading. Our\nresults are important for a better understanding of the dynamics of complex\ncontagion and, from a practical side, for the identification of efficient\nspreaders in viral phenomena.\n","authors":"Raquel A Baños|Javier Borge-Holthoefer|Yamir Moreno","affiliations":"","link_abstract":"http://arxiv.org/abs/1303.4629v1","link_pdf":"http://arxiv.org/pdf/1303.4629v1","link_doi":"","comment":"Submitted to EPJ Data Science","journal_ref":"","doi":"","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.SI"} {"id":"1304.0412v1","submitted":"2013-04-01 18:39:32","updated":"2013-04-01 18:39:32","title":"The FuturICT Education Accelerator","abstract":" Education is a major force for economic and social wellbeing. Despite high\naspirations, education at all levels can be expensive and ineffective. Three\nGrand Challenges are identified: (1) enable people to learn orders of magnitude\nmore effectively, (2) enable people to learn at orders of magnitude less cost,\nand (3) demonstrate success by exemplary interdisciplinary education in complex\nsystems science. A ten year `man-on-the-moon' project is proposed in which\nFuturICT's unique combination of Complexity, Social and Computing Sciences\ncould provide an urgently needed transdisciplinary language for making sense of\neducational systems. In close dialogue with educational theory and practice,\nand grounded in the emerging data science and learning analytics paradigms,\nthis will translate into practical tools (both analytical and computational)\nfor researchers, practitioners and leaders; generative principles for resilient\neducational ecosystems; and innovation for radically scalable, yet\npersonalised, learner engagement and assessment. The proposed {\\em Education\nAccelerator} will serve as a `wind tunnel' for testing these ideas in the\ncontext of real educational programmes, with an international virtual campus\ndelivering complex systems education exploiting the new understanding of\ncomplex, social, computationally enhanced organisational structure developed\nwithin FuturICT.\n","authors":"Jeffrey Johnson|Simon Buckingham Shum|Alistair Willis|Steven Bishop|Theodore Zamenopoulos|Stephen Swithenby|Robert MacKay|Yasmin Merali|Andras Lorincz|Carmen Costea|Paul Bourgine|Jorge Loucas Atis Kapenieks|Paul Kelley|Sally Caird|Jane Bromley|Ruth Deakin Crick|Chris Goldspink|Pierre Collet|Anna Carbone|Dirk Helbing","affiliations":"","link_abstract":"http://arxiv.org/abs/1304.0412v1","link_pdf":"http://arxiv.org/pdf/1304.0412v1","link_doi":"http://dx.doi.org/10.1140/epjst/e2012-01693-0","comment":"","journal_ref":"European Physical Journal-Special Topics, vol. 214, pp 215-243\n (2012)","doi":"10.1140/epjst/e2012-01693-0","primary_category":"physics.ed-ph","categories":"physics.ed-ph|physics.soc-ph"} {"id":"1304.1903v1","submitted":"2013-04-06 15:25:52","updated":"2013-04-06 15:25:52","title":"Towards a living earth simulator","abstract":" The Living Earth Simulator (LES) is one of the core components of the\nFuturICT architecture. It will work as a federation of methods, tools,\ntechniques and facilities supporting all of the FuturICT simulation-related\nactivities to allow and encourage interactive exploration and understanding of\nsocietal issues. Society-relevant problems will be targeted by leaning on\napproaches based on complex systems theories and data science in tight\ninteraction with the other components of FuturICT. The LES will evaluate and\nprovide answers to real-world questions by taking into account multiple\nscenarios. It will build on present approaches such as agent-based simulation\nand modeling, multiscale modelling, statistical inference, and data mining,\nmoving beyond disciplinary borders to achieve a new perspective on complex\nsocial systems.\n","authors":"M. Paolucci|D. Kossman|R. Conte|P. Lukowicz|P. Argyrakis|A. Blandford|G. Bonelli|S. Anderson|S. de Freitas|B. Edmonds|N. Gilbert|M. Gross|J. Kohlhammer|P. Koumoutsakos|A. Krause|B. -O. Linnér|P. Slusallek|O. Sorkine|R. W. Sumner|D. Helbing","affiliations":"","link_abstract":"http://arxiv.org/abs/1304.1903v1","link_pdf":"http://arxiv.org/pdf/1304.1903v1","link_doi":"http://dx.doi.org/10.1140/epjst/e2012-01689-8","comment":"","journal_ref":"Eur. Phys. J. Special Topics vol. 214, pp. 77-108 (2012)","doi":"10.1140/epjst/e2012-01689-8","primary_category":"physics.comp-ph","categories":"physics.comp-ph|cs.SI|physics.soc-ph"} {"id":"1308.0239v3","submitted":"2013-08-01 15:24:15","updated":"2014-08-14 14:12:25","title":"Modeling the Rise in Internet-based Petitions","abstract":" Contemporary collective action, much of which involves social media and other\nInternet-based platforms, leaves a digital imprint which may be harvested to\nbetter understand the dynamics of mobilization. Petition signing is an example\nof collective action which has gained in popularity with rising use of social\nmedia and provides such data for the whole population of petition signatories\nfor a given platform. This paper tracks the growth curves of all 20,000\npetitions to the UK government over 18 months, analyzing the rate of growth and\noutreach mechanism. Previous research has suggested the importance of the first\nday to the ultimate success of a petition, but has not examined early growth\nwithin that day, made possible here through hourly resolution in the data. The\nanalysis shows that the vast majority of petitions do not achieve any measure\nof success; over 99 percent fail to get the 10,000 signatures required for an\nofficial response and only 0.1 percent attain the 100,000 required for a\nparliamentary debate. We analyze the data through a multiplicative process\nmodel framework to explain the heterogeneous growth of signatures at the\npopulation level. We define and measure an average outreach factor for\npetitions and show that it decays very fast (reducing to 0.1% after 10 hours).\nAfter 24 hours, a petition's fate is virtually set. The findings seem to\nchallenge conventional analyses of collective action from economics and\npolitical science, where the production function has been assumed to follow an\nS-shaped curve.\n","authors":"Taha Yasseri|Scott A. Hale|Helen Margetts","affiliations":"","link_abstract":"http://arxiv.org/abs/1308.0239v3","link_pdf":"http://arxiv.org/pdf/1308.0239v3","link_doi":"","comment":"Submitted to EPJ Data Science","journal_ref":"","doi":"","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.CY|cs.HC|cs.SI|physics.data-an"} {"id":"1308.0309v2","submitted":"2013-08-01 19:29:28","updated":"2014-11-04 11:32:18","title":"Fast filtering and animation of large dynamic networks","abstract":" Detecting and visualizing what are the most relevant changes in an evolving\nnetwork is an open challenge in several domains. We present a fast algorithm\nthat filters subsets of the strongest nodes and edges representing an evolving\nweighted graph and visualize it by either creating a movie, or by streaming it\nto an interactive network visualization tool. The algorithm is an approximation\nof exponential sliding time-window that scales linearly with the number of\ninteractions. We compare the algorithm against rectangular and exponential\nsliding time-window methods. Our network filtering algorithm: i) captures\npersistent trends in the structure of dynamic weighted networks, ii) smoothens\ntransitions between the snapshots of dynamic network, and iii) uses limited\nmemory and processor time. The algorithm is publicly available as open-source\nsoftware.\n","authors":"Przemyslaw A. Grabowicz|Luca Maria Aiello|Filippo Menczer","affiliations":"","link_abstract":"http://arxiv.org/abs/1308.0309v2","link_pdf":"http://arxiv.org/pdf/1308.0309v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-014-0027-8","comment":"6 figures, 2 tables","journal_ref":"EPJ Data Science, Volume 3, Issue 1, 2014","doi":"10.1140/epjds/s13688-014-0027-8","primary_category":"cs.SI","categories":"cs.SI|cs.CY|physics.soc-ph"} {"id":"1308.0641v1","submitted":"2013-08-02 23:54:44","updated":"2013-08-02 23:54:44","title":"United Statistical Algorithm, Small and Big Data: Future OF Statistician","abstract":" This article provides the role of big idea statisticians in future of Big\nData Science. We describe the `United Statistical Algorithms' framework for\ncomprehensive unification of traditional and novel statistical methods for\nmodeling Small Data and Big Data, especially mixed data (discrete, continuous).\n","authors":"Emanuel Parzen|Subhadeep Mukhopadhyay","affiliations":"","link_abstract":"http://arxiv.org/abs/1308.0641v1","link_pdf":"http://arxiv.org/pdf/1308.0641v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|stat.ME|stat.ML|stat.TH"} {"id":"1308.3603v2","submitted":"2013-08-16 11:14:57","updated":"2014-02-24 13:48:27","title":"The geography and carbon footprint of mobile phone use in Cote d'Ivoire","abstract":" The newly released Orange D4D mobile phone data base provides new insights\ninto the use of mobile technology in a developing country. Here we perform a\nseries of spatial data analyses that reveal important geographic aspects of\nmobile phone use in Cote d'Ivoire. We first map the locations of base stations\nwith respect to the population distribution and the number and duration of\ncalls at each base station. On this basis, we estimate the energy consumed by\nthe mobile phone network. Finally, we perform an analysis of inter-city\nmobility, and identify high-traffic roads in the country.\n","authors":"Vsevolod Salnikov|Daniel Schien|Hyejin Youn|Renaud Lambiotte|Michael T. Gastner","affiliations":"","link_abstract":"http://arxiv.org/abs/1308.3603v2","link_pdf":"http://arxiv.org/pdf/1308.3603v2","link_doi":"http://dx.doi.org/10.1140/epjds21","comment":"23 pages, 8 figures, 1 table","journal_ref":"EPJ Data Science 2014, 3:3","doi":"10.1140/epjds21","primary_category":"cs.CY","categories":"cs.CY|physics.soc-ph"} {"id":"1308.6823v1","submitted":"2013-08-30 19:30:44","updated":"2013-08-30 19:30:44","title":"A Hypergraph-Partitioned Vertex Programming Approach for Large-scale\n Consensus Optimization","abstract":" In modern data science problems, techniques for extracting value from big\ndata require performing large-scale optimization over heterogenous, irregularly\nstructured data. Much of this data is best represented as multi-relational\ngraphs, making vertex programming abstractions such as those of Pregel and\nGraphLab ideal fits for modern large-scale data analysis. In this paper, we\ndescribe a vertex-programming implementation of a popular consensus\noptimization technique known as the alternating direction of multipliers\n(ADMM). ADMM consensus optimization allows elegant solution of complex\nobjectives such as inference in rich probabilistic models. We also introduce a\nnovel hypergraph partitioning technique that improves over state-of-the-art\npartitioning techniques for vertex programming and significantly reduces the\ncommunication cost by reducing the number of replicated nodes up to an order of\nmagnitude. We implemented our algorithm in GraphLab and measure scaling\nperformance on a variety of realistic bipartite graph distributions and a large\nsynthetic voter-opinion analysis application. In our experiments, we are able\nto achieve a 50% improvement in runtime over the current state-of-the-art\nGraphLab partitioning scheme.\n","authors":"Hui Miao|Xiangyang Liu|Bert Huang|Lise Getoor","affiliations":"","link_abstract":"http://arxiv.org/abs/1308.6823v1","link_pdf":"http://arxiv.org/pdf/1308.6823v1","link_doi":"http://dx.doi.org/10.1109/BigData.2013.6691623","comment":"","journal_ref":"","doi":"10.1109/BigData.2013.6691623","primary_category":"cs.AI","categories":"cs.AI|cs.DC"} {"id":"1309.2895v5","submitted":"2013-09-11 17:18:30","updated":"2019-08-19 20:18:01","title":"Sparse and Functional Principal Components Analysis","abstract":" Regularized variants of Principal Components Analysis, especially Sparse PCA\nand Functional PCA, are among the most useful tools for the analysis of complex\nhigh-dimensional data. Many examples of massive data, have both sparse and\nfunctional (smooth) aspects and may benefit from a regularization scheme that\ncan capture both forms of structure. For example, in neuro-imaging data, the\nbrain's response to a stimulus may be restricted to a discrete region of\nactivation (spatial sparsity), while exhibiting a smooth response within that\nregion. We propose a unified approach to regularized PCA which can induce both\nsparsity and smoothness in both the row and column principal components. Our\nframework generalizes much of the previous literature, with sparse, functional,\ntwo-way sparse, and two-way functional PCA all being special cases of our\napproach. Our method permits flexible combinations of sparsity and smoothness\nthat lead to improvements in feature selection and signal recovery, as well as\nmore interpretable PCA factors. We demonstrate the efficacy of our method on\nsimulated data and a neuroimaging example on EEG data.\n","authors":"Genevera I. Allen|Michael Weylandt","affiliations":"","link_abstract":"http://arxiv.org/abs/1309.2895v5","link_pdf":"http://arxiv.org/pdf/1309.2895v5","link_doi":"http://dx.doi.org/10.1109/DSW.2019.8755778","comment":"The published version of this paper incorrectly thanks \"Luofeng Luo\"\n instead of \"Luofeng Liao\" in the Acknowledgements","journal_ref":"DSW 2019: Proceedings of the IEEE Data Science Workshop 2019, pp.\n 11-16","doi":"10.1109/DSW.2019.8755778","primary_category":"stat.ML","categories":"stat.ML"} {"id":"1309.7824v3","submitted":"2013-09-30 12:48:35","updated":"2019-12-12 23:47:00","title":"Linear Regression from Strategic Data Sources","abstract":" Linear regression is a fundamental building block of statistical data\nanalysis. It amounts to estimating the parameters of a linear model that maps\ninput features to corresponding outputs. In the classical setting where the\nprecision of each data point is fixed, the famous Aitken/Gauss-Markov theorem\nin statistics states that generalized least squares (GLS) is a so-called \"Best\nLinear Unbiased Estimator\" (BLUE). In modern data science, however, one often\nfaces strategic data sources, namely, individuals who incur a cost for\nproviding high-precision data.\n In this paper, we study a setting in which features are public but\nindividuals choose the precision of the outputs they reveal to an analyst. We\nassume that the analyst performs linear regression on this dataset, and\nindividuals benefit from the outcome of this estimation. We model this scenario\nas a game where individuals minimize a cost comprising two components: (a) an\n(agent-specific) disclosure cost for providing high-precision data; and (b) a\n(global) estimation cost representing the inaccuracy in the linear model\nestimate. In this game, the linear model estimate is a public good that\nbenefits all individuals. We establish that this game has a unique non-trivial\nNash equilibrium. We study the efficiency of this equilibrium and we prove\ntight bounds on the price of stability for a large class of disclosure and\nestimation costs. Finally, we study the estimator accuracy achieved at\nequilibrium. We show that, in general, Aitken's theorem does not hold under\nstrategic data sources, though it does hold if individuals have identical\ndisclosure costs (up to a multiplicative factor). When individuals have\nnon-identical costs, we derive a bound on the improvement of the equilibrium\nestimation cost that can be achieved by deviating from GLS, under mild\nassumptions on the disclosure cost functions.\n","authors":"Nicolas Gast|Stratis Ioannidis|Patrick Loiseau|Benjamin Roussillon","affiliations":"","link_abstract":"http://arxiv.org/abs/1309.7824v3","link_pdf":"http://arxiv.org/pdf/1309.7824v3","link_doi":"","comment":"This version (v3) extends the results on the sub-optimality of GLS\n (Section 6) and improves writing in multiple places compared to v2. Compared\n to the initial version v1, it also fixes an error in Theorem 6 (now Theorem\n 5), and extended many of the results","journal_ref":"","doi":"","primary_category":"cs.GT","categories":"cs.GT|cs.LG|math.ST|stat.TH"} {"id":"1310.4461v2","submitted":"2013-10-16 17:44:35","updated":"2014-03-20 21:27:29","title":"Scoring dynamics across professional team sports: tempo, balance and\n predictability","abstract":" Despite growing interest in quantifying and modeling the scoring dynamics\nwithin professional sports games, relative little is known about what patterns\nor principles, if any, cut across different sports. Using a comprehensive data\nset of scoring events in nearly a dozen consecutive seasons of college and\nprofessional (American) football, professional hockey, and professional\nbasketball, we identify several common patterns in scoring dynamics. Across\nthese sports, scoring tempo---when scoring events occur---closely follows a\ncommon Poisson process, with a sport-specific rate. Similarly, scoring\nbalance---how often a team wins an event---follows a common Bernoulli process,\nwith a parameter that effectively varies with the size of the lead. Combining\nthese processes within a generative model of gameplay, we find they both\nreproduce the observed dynamics in all four sports and accurately predict game\noutcomes. These results demonstrate common dynamical patterns underlying\nwithin-game scoring dynamics across professional team sports, and suggest\nspecific mechanisms for driving them. We close with a brief discussion of the\nimplications of our results for several popular hypotheses about sports\ndynamics.\n","authors":"Sears Merritt|Aaron Clauset","affiliations":"","link_abstract":"http://arxiv.org/abs/1310.4461v2","link_pdf":"http://arxiv.org/pdf/1310.4461v2","link_doi":"http://dx.doi.org/10.1140/epjds29","comment":"18 pages, 8 figures, 4 tables, 2 appendices","journal_ref":"EPJ Data Science 3, 4 (2014)","doi":"10.1140/epjds29","primary_category":"stat.AP","categories":"stat.AP|cs.CY|physics.data-an|physics.soc-ph"} {"id":"1310.7505v1","submitted":"2013-10-28 17:33:22","updated":"2013-10-28 17:33:22","title":"Quantifying age- and gender-related diabetes comorbidity risks using\n nation-wide big claims data","abstract":" Currently emerging \"big data\" techniques are reshaping medical science into a\ndata science. Medical claims data allow assessing an entire nation's health\nstate in a quantitative way, in particular with regard to the occurrences and\nconsequences of chronic and pandemic diseases like diabetes.\n We develop a quantitative, statistical approach to test for associations\nbetween the incidence of type 1 or type 2 diabetes and any possible other\ndisease as provided by the ICD10 diagnosis codes using a complete set of\nAustrian inpatient data. With a new co-occurrence analysis the relative risks\nfor each possible comorbidity are studied as a function of patient age and\ngender, a temporal analysis investigates whether the onset of diabetes\ntypically precedes or follows the onset of the other disease. The samples is\nalways of maximal size, i.e. contains all patients with that comorbidity within\nthe country. The present study is an equivalent of almost 40,000 studies, all\nwith maximum patient number available. Out of more than thousand possible\nassociations, 123 comorbid diseases for type 1 or type 2 diabetes are\nidentified at high significance levels.\n Well known diabetic comorbidities are recovered, such as retinopathies,\nhypertension, chronic kidney diseases, etc. This validates the method.\nAdditionally, a number of comorbidities are identified which have only been\nrecognized to a lesser extent, for example epilepsy, sepsis, or mental\ndisorders. The temporal evolution, age, and gender-dependence of these\ncomorbidities are discussed. The new statistical-network methodology developed\nhere can be readily applied to other chronic diseases.\n","authors":"Peter Klimek|Alexandra Kautzky-Willer|Anna Chmiel|Irmgard Schiller-Früwirth|Stefan Thurner","affiliations":"","link_abstract":"http://arxiv.org/abs/1310.7505v1","link_pdf":"http://arxiv.org/pdf/1310.7505v1","link_doi":"","comment":"9 pages, 3 figures","journal_ref":"","doi":"","primary_category":"stat.AP","categories":"stat.AP"} {"id":"1310.8508v2","submitted":"2013-10-31 14:07:35","updated":"2013-12-10 15:11:32","title":"The distorted mirror of Wikipedia: a quantitative analysis of Wikipedia\n coverage of academics","abstract":" Activity of modern scholarship creates online footprints galore. Along with\ntraditional metrics of research quality, such as citation counts, online images\nof researchers and institutions increasingly matter in evaluating academic\nimpact, decisions about grant allocation, and promotion. We examined 400\nbiographical Wikipedia articles on academics from four scientific fields to\ntest if being featured in the world's largest online encyclopedia is correlated\nwith higher academic notability (assessed through citation counts). We found no\nstatistically significant correlation between Wikipedia articles metrics\n(length, number of edits, number of incoming links from other articles, etc.)\nand academic notability of the mentioned researchers. We also did not find any\nevidence that the scientists with better WP representation are necessarily more\nprominent in their fields. In addition, we inspected the Wikipedia coverage of\nnotable scientists sampled from Thomson Reuters list of \"highly cited\nresearchers\". In each of the examined fields, Wikipedia failed in covering\nnotable scholars properly. Both findings imply that Wikipedia might be\nproducing an inaccurate image of academics on the front end of science. By\nshedding light on how public perception of academic progress is formed, this\nstudy alerts that a subjective element might have been introduced into the\nhitherto structured system of academic evaluation.\n","authors":"Anna Samoilenko|Taha Yasseri","affiliations":"","link_abstract":"http://arxiv.org/abs/1310.8508v2","link_pdf":"http://arxiv.org/pdf/1310.8508v2","link_doi":"http://dx.doi.org/10.1140/epjds20","comment":"To appear in EPJ Data Science. To have the Additional Files and\n Datasets e-mail the corresponding author","journal_ref":"EPJ Data Science 2014, 3:1","doi":"10.1140/epjds20","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.CY|cs.DL|cs.SI|physics.data-an"} {"id":"1311.0562v2","submitted":"2013-11-04 01:56:04","updated":"2013-11-06 13:44:49","title":"LP Mixed Data Science : Outline of Theory","abstract":" This article presents the theoretical foundation of a new frontier of\nresearch-`LP Mixed Data Science'-that simultaneously extends and integrates the\npractice of traditional and novel statistical methods for nonparametric\nexploratory data modeling, and is applicable to the teaching and training of\nstatistics.\n Statistics journals have great difficulty accepting papers unlike those\npreviously published. For statisticians with new big ideas a practical strategy\nis to publish them in many small applied studies which enables one to provide\nreferences to work of others. This essay outlines the many concepts, new\ntheory, and important algorithms of our new culture of statistical science\ncalled LP MIXED DATA SCIENCE. It provides comprehensive solutions to problems\nof data analysis and nonparametric modeling of many variables that are\ncontinuous or discrete, which does not yet have a large literature. It develops\na new modeling approach to nonparametric estimation of the multivariate copula\ndensity. We discuss the theory which we believe is very elegant (and can\nprovide a framework for United Statistical Algorithms, for traditional Small\nData methods and Big Data methods).\n","authors":"Emanuel Parzen|Subhadeep Mukhopadhyay","affiliations":"","link_abstract":"http://arxiv.org/abs/1311.0562v2","link_pdf":"http://arxiv.org/pdf/1311.0562v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|stat.ME|stat.TH"} {"id":"1311.6063v5","submitted":"2013-11-23 22:39:52","updated":"2019-07-16 14:12:22","title":"NILE: Fast Natural Language Processing for Electronic Health Records","abstract":" Objective: Narrative text in Electronic health records (EHR) contain rich\ninformation for medical and data science studies. This paper introduces the\ndesign and performance of Narrative Information Linear Extraction (NILE), a\nnatural language processing (NLP) package for EHR analysis that we share with\nthe medical informatics community. Methods: NILE uses a modified prefix-tree\nsearch algorithm for named entity recognition, which can detect prefix and\nsuffix sharing. The semantic analyses are implemented as rule-based finite\nstate machines. Analyses include negation, location, modification, family\nhistory, and ignoring. Result: The processing speed of NILE is hundreds to\nthousands times faster than existing NLP software for medical text. The\naccuracy of presence analysis of NILE is on par with the best performing models\non the 2010 i2b2/VA NLP challenge data. Conclusion: The speed, accuracy, and\nbeing able to operate via API make NILE a valuable addition to the NLP software\nfor medical informatics and data science.\n","authors":"Sheng Yu|Tianrun Cai|Tianxi Cai","affiliations":"","link_abstract":"http://arxiv.org/abs/1311.6063v5","link_pdf":"http://arxiv.org/pdf/1311.6063v5","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CL","categories":"cs.CL"} {"id":"1401.3269v1","submitted":"2014-01-14 17:43:11","updated":"2014-01-14 17:43:11","title":"Teaching precursors to data science in introductory and second courses\n in statistics","abstract":" Statistics students need to develop the capacity to make sense of the\nstaggering amount of information collected in our increasingly data-centered\nworld. Data science is an important part of modern statistics, but our\nintroductory and second statistics courses often neglect this fact. This paper\ndiscusses ways to provide a practical foundation for students to learn to\n\"compute with data\" as defined by Nolan and Temple Lang (2010), as well as\ndevelop \"data habits of mind\" (Finzer, 2013). We describe how introductory and\nsecond courses can integrate two key precursors to data science: the use of\nreproducible analysis tools and access to large databases. By introducing\nstudents to commonplace tools for data management, visualization, and\nreproducible analysis in data science and applying these to real-world\nscenarios, we prepare them to think statistically in the era of big data.\n","authors":"Nicholas J Horton|Benjamin S Baumer|Hadley Wickham","affiliations":"","link_abstract":"http://arxiv.org/abs/1401.3269v1","link_pdf":"http://arxiv.org/pdf/1401.3269v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.CO","categories":"stat.CO|cs.CY|stat.OT|62-07"} {"id":"1401.6157v2","submitted":"2014-01-23 20:42:11","updated":"2014-12-10 01:20:41","title":"Exploiting citation networks for large-scale author name disambiguation","abstract":" We present a novel algorithm and validation method for disambiguating author\nnames in very large bibliographic data sets and apply it to the full Web of\nScience (WoS) citation index. Our algorithm relies only upon the author and\ncitation graphs available for the whole period covered by the WoS. A pair-wise\npublication similarity metric, which is based on common co-authors,\nself-citations, shared references and citations, is established to perform a\ntwo-step agglomerative clustering that first connects individual papers and\nthen merges similar clusters. This parameterized model is optimized using an\nh-index based recall measure, favoring the correct assignment of well-cited\npublications, and a name-initials-based precision using WoS metadata and\ncross-referenced Google Scholar profiles. Despite the use of limited metadata,\nwe reach a recall of 87% and a precision of 88% with a preference for\nresearchers with high h-index values. 47 million articles of WoS can be\ndisambiguated on a single machine in less than a day. We develop an h-index\ndistribution model, confirming that the prediction is in excellent agreement\nwith the empirical data, and yielding insight into the utility of the h-index\nin real academic ranking scenarios.\n","authors":"Christian Schulz|Amin Mazloumian|Alexander M Petersen|Orion Penner|Dirk Helbing","affiliations":"","link_abstract":"http://arxiv.org/abs/1401.6157v2","link_pdf":"http://arxiv.org/pdf/1401.6157v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-014-0011-3","comment":"14 pages, 5 figures","journal_ref":"EPJ Data Science 2014, 3:11","doi":"10.1140/epjds/s13688-014-0011-3","primary_category":"cs.DL","categories":"cs.DL|cs.SI|physics.soc-ph"} {"id":"1402.0459v1","submitted":"2014-02-03 18:47:41","updated":"2014-02-03 18:47:41","title":"Applying Supervised Learning Algorithms and a New Feature Selection\n Method to Predict Coronary Artery Disease","abstract":" From a fresh data science perspective, this thesis discusses the prediction\nof coronary artery disease based on genetic variations at the DNA base pair\nlevel, called Single-Nucleotide Polymorphisms (SNPs), collected from the\nOntario Heart Genomics Study (OHGS).\n First, the thesis explains two commonly used supervised learning algorithms,\nthe k-Nearest Neighbour (k-NN) and Random Forest classifiers, and includes a\ncomplete proof that the k-NN classifier is universally consistent in any finite\ndimensional normed vector space. Second, the thesis introduces two\ndimensionality reduction steps, Random Projections, a known feature extraction\ntechnique based on the Johnson-Lindenstrauss lemma, and a new method termed\nMass Transportation Distance (MTD) Feature Selection for discrete domains.\nThen, this thesis compares the performance of Random Projections with the k-NN\nclassifier against MTD Feature Selection and Random Forest, for predicting\nartery disease based on accuracy, the F-Measure, and area under the Receiver\nOperating Characteristic (ROC) curve.\n The comparative results demonstrate that MTD Feature Selection with Random\nForest is vastly superior to Random Projections and k-NN. The Random Forest\nclassifier is able to obtain an accuracy of 0.6660 and an area under the ROC\ncurve of 0.8562 on the OHGS genetic dataset, when 3335 SNPs are selected by MTD\nFeature Selection for classification. This area is considerably better than the\nprevious high score of 0.608 obtained by Davies et al. in 2010 on the same\ndataset.\n","authors":"Hubert Haoyang Duan","affiliations":"","link_abstract":"http://arxiv.org/abs/1402.0459v1","link_pdf":"http://arxiv.org/pdf/1402.0459v1","link_doi":"","comment":"This is a Master of Science in Mathematics thesis under the\n supervision of Dr. Vladimir Pestov and Dr. George Wells submitted on January\n 31, 2014 at the University of Ottawa; 102 pages and 15 figures","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1402.3488v2","submitted":"2014-02-14 15:10:16","updated":"2015-09-17 14:17:51","title":"A Unifying Model for Representing Time-Varying Graphs","abstract":" Graph-based models form a fundamental aspect of data representation in Data\nSciences and play a key role in modeling complex networked systems. In\nparticular, recently there is an ever-increasing interest in modeling dynamic\ncomplex networks, i.e. networks in which the topological structure (nodes and\nedges) may vary over time. In this context, we propose a novel model for\nrepresenting finite discrete Time-Varying Graphs (TVGs), which are typically\nused to model dynamic complex networked systems. We analyze the data structures\nbuilt from our proposed model and demonstrate that, for most practical cases,\nthe asymptotic memory complexity of our model is in the order of the\ncardinality of the set of edges. Further, we show that our proposal is an\nunifying model that can represent several previous (classes of) models for\ndynamic networks found in the recent literature, which in general are unable to\nrepresent each other. In contrast to previous models, our proposal is also able\nto intrinsically model cyclic (i.e. periodic) behavior in dynamic networks.\nThese representation capabilities attest the expressive power of our proposed\nunifying model for TVGs. We thus believe our unifying model for TVGs is a step\nforward in the theoretical foundations for data analysis of complex networked\nsystems.\n","authors":"Klaus Wehmuth|Artur Ziviani|Eric Fleury","affiliations":"LNCC / MCTI|LNCC / MCTI|ENS de Lyon / INRIA - Université de Lyon","link_abstract":"http://arxiv.org/abs/1402.3488v2","link_pdf":"http://arxiv.org/pdf/1402.3488v2","link_doi":"","comment":"Also appears in the Proc. of the IEEE International Conference on\n Data Science and Advanced Analytics (IEEE DSAA'2015)","journal_ref":"","doi":"","primary_category":"cs.DS","categories":"cs.DS|cs.DM|cs.SI"} {"id":"1402.5593v1","submitted":"2014-02-23 10:07:59","updated":"2014-02-23 10:07:59","title":"Reciprocity in Gift-Exchange-Games","abstract":" This paper presents an analysis of data from a gift-exchange-game experiment.\nThe experiment was described in `The Impact of Social Comparisons on\nReciprocity' by G\\\"achter et al. 2012. Since this paper uses state-of-art data\nscience techniques, the results provide a different point of view on the\nproblem. As already shown in relevant literature from experimental economics,\nhuman decisions deviate from rational payoff maximization. The average gift\nrate was $31$%. Gift rate was under no conditions zero. Further, we derive some\nspecial findings and calculate their significance.\n","authors":"Rustam Tagiew|Dmitry I. Ignatov","affiliations":"","link_abstract":"http://arxiv.org/abs/1402.5593v1","link_pdf":"http://arxiv.org/pdf/1402.5593v1","link_doi":"","comment":"6 pages, 2 figures, 5 tables","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI"} {"id":"1402.5932v1","submitted":"2014-02-24 20:07:18","updated":"2014-02-24 20:07:18","title":"A New Framework for a Model-Based Data Science Computational Platform","abstract":" Astronomy produces extremely large data sets from ground-based telescopes,\nspace missions, and simulation. The volume and complexity of these rich data\nsets require new approaches and advanced tools to understand the information\ncontained therein. No one can load this data on their own computer, most cannot\neven keep it at their institution, and worse, no platform exists that allows\none to evaluate their models across the whole of the data. Simply having an\nextremely large volume of data available in one place is not sufficient; one\nmust be able to make valid, rigorous, scientific comparisons across very\ndifferent data sets from very different instrumentation. We propose a framework\nto directly address this which has the following components: a model-based\ncomputational platform, streamlined access to large volumes of data, and an\neducational and social platform for both researchers and the public.\n","authors":"Demitri Muna|Eric Huff","affiliations":"","link_abstract":"http://arxiv.org/abs/1402.5932v1","link_pdf":"http://arxiv.org/pdf/1402.5932v1","link_doi":"","comment":"submitted to Astronomy and Computing","journal_ref":"","doi":"","primary_category":"astro-ph.IM","categories":"astro-ph.IM"} {"id":"1403.3568v2","submitted":"2014-03-14 13:16:27","updated":"2014-06-15 00:13:59","title":"Modeling Social Dynamics in a Collaborative Environment","abstract":" Wikipedia is a prime example of today's value production in a collaborative\nenvironment. Using this example, we model the emergence, persistence and\nresolution of severe conflicts during collaboration by coupling opinion\nformation with article editing in a bounded confidence dynamics. The complex\nsocial behavior involved in editing articles is implemented as a minimal model\nwith two basic elements; (i) individuals interact directly to share information\nand convince each other, and (ii) they edit a common medium to establish their\nown opinions. Opinions of the editors and that represented by the article are\ncharacterised by a scalar variable. When the pool of editors is fixed, three\nregimes can be distinguished: (a) a stable mainstream article opinion is\ncontinuously contested by editors with extremist views and there is slow\nconvergence towards consensus, (b) the article oscillates between editors with\nextremist views, reaching consensus relatively fast at one of the extremes, and\n(c) the extremist editors are converted very fast to the mainstream opinion and\nthe article has an erratic evolution. When editors are renewed with a certain\nrate, a dynamical transition occurs between different kinds of edit wars, which\nqualitatively reflect the dynamics of conflicts as observed in real Wikipedia\ndata.\n","authors":"Gerardo Iñiguez|János Török|Taha Yasseri|Kimmo Kaski|János Kertész","affiliations":"","link_abstract":"http://arxiv.org/abs/1403.3568v2","link_pdf":"http://arxiv.org/pdf/1403.3568v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-014-0007-z","comment":"Revised version, to appear in EPJ Data Science; 19 pages 9 figures","journal_ref":"EPJ Data Science 3 (1), 7 (2014)","doi":"10.1140/epjds/s13688-014-0007-z","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.CY|cs.SI|physics.data-an"} {"id":"1404.5971v2","submitted":"2014-04-23 20:27:58","updated":"2014-06-09 19:41:44","title":"Mechanism Design for Data Science","abstract":" Good economic mechanisms depend on the preferences of participants in the\nmechanism. For example, the revenue-optimal auction for selling an item is\nparameterized by a reserve price, and the appropriate reserve price depends on\nhow much the bidders are willing to pay. A mechanism designer can potentially\nlearn about the participants' preferences by observing historical data from the\nmechanism; the designer could then update the mechanism in response to learned\npreferences to improve its performance. The challenge of such an approach is\nthat the data corresponds to the actions of the participants and not their\npreferences. Preferences can potentially be inferred from actions but the\ndegree of inference possible depends on the mechanism. In the optimal auction\nexample, it is impossible to learn anything about preferences of bidders who\nare not willing to pay the reserve price. These bidders will not cast bids in\nthe auction and, from historical bid data, the auctioneer could never learn\nthat lowering the reserve price would give a higher revenue (even if it would).\nTo address this impossibility, the auctioneer could sacrifice revenue\noptimality in the initial auction to obtain better inference properties so that\nthe auction's parameters can be adapted to changing preferences in the future.\nThis paper develops the theory for optimal mechanism design subject to good\ninferability.\n","authors":"Shuchi Chawla|Jason Hartline|Denis Nekipelov","affiliations":"","link_abstract":"http://arxiv.org/abs/1404.5971v2","link_pdf":"http://arxiv.org/pdf/1404.5971v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.GT","categories":"cs.GT|J.4"} {"id":"1405.2601v1","submitted":"2014-05-11 23:16:37","updated":"2014-05-11 23:16:37","title":"LP Approach to Statistical Modeling","abstract":" We present an approach to statistical data modeling and exploratory data\nanalysis called `LP Statistical Data Science.' It aims to generalize and unify\ntraditional and novel statistical measures, methods, and exploratory tools.\nThis article outlines fundamental concepts along with real-data examples to\nillustrate how the `LP Statistical Algorithm' can systematically tackle\ndifferent varieties of data types, data patterns, and data structures under a\ncoherent theoretical framework. A fundamental role is played by specially\ndesigned orthonormal basis of a random variable X for linear (Hilbert space\ntheory) representation of a general function of X, such as $\\mbox{E}[Y \\mid\nX]$.\n","authors":"Subhadeep Mukhopadhyay|Emanuel Parzen","affiliations":"","link_abstract":"http://arxiv.org/abs/1405.2601v1","link_pdf":"http://arxiv.org/pdf/1405.2601v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|stat.ME|stat.TH"} {"id":"1405.2881v4","submitted":"2014-05-12 19:15:32","updated":"2015-08-08 17:20:57","title":"Consistency of random forests","abstract":" Random forests are a learning algorithm proposed by Breiman [Mach. Learn. 45\n(2001) 5--32] that combines several randomized decision trees and aggregates\ntheir predictions by averaging. Despite its wide usage and outstanding\npractical performance, little is known about the mathematical properties of the\nprocedure. This disparity between theory and practice originates in the\ndifficulty to simultaneously analyze both the randomization process and the\nhighly data-dependent tree structure. In the present paper, we take a step\nforward in forest exploration by proving a consistency result for Breiman's\n[Mach. Learn. 45 (2001) 5--32] original algorithm in the context of additive\nregression models. Our analysis also sheds an interesting light on how random\nforests can nicely adapt to sparsity. 1. Introduction. Random forests are an\nensemble learning method for classification and regression that constructs a\nnumber of randomized decision trees during the training phase and predicts by\naveraging the results. Since its publication in the seminal paper of Breiman\n(2001), the procedure has become a major data analysis tool, that performs well\nin practice in comparison with many standard methods. What has greatly\ncontributed to the popularity of forests is the fact that they can be applied\nto a wide range of prediction problems and have few parameters to tune. Aside\nfrom being simple to use, the method is generally recognized for its accuracy\nand its ability to deal with small sample sizes, high-dimensional feature\nspaces and complex data structures. The random forest methodology has been\nsuccessfully involved in many practical problems, including air quality\nprediction (winning code of the EMC data science global hackathon in 2012, see\nhttp://www.kaggle.com/c/dsg-hackathon), chemoinformatics [Svetnik et al.\n(2003)], ecology [Prasad, Iverson and Liaw (2006), Cutler et al. (2007)], 3D\n","authors":"Erwan Scornet|Gérard Biau|Jean-Philippe Vert","affiliations":"LSTA|LSTA, LPMA|CBIO","link_abstract":"http://arxiv.org/abs/1405.2881v4","link_pdf":"http://arxiv.org/pdf/1405.2881v4","link_doi":"http://dx.doi.org/10.1214/15-AOS1321","comment":"","journal_ref":"Annals of Statistics, Institute of Mathematical Statistics (IMS),\n 2015, 43 (4), pp.1716-1741","doi":"10.1214/15-AOS1321","primary_category":"math.ST","categories":"math.ST|stat.ML|stat.TH"} {"id":"1406.2015v1","submitted":"2014-06-08 19:19:45","updated":"2014-06-08 19:19:45","title":"MOOCdb: Developing Standards and Systems to Support MOOC Data Science","abstract":" We present a shared data model for enabling data science in Massive Open\nOnline Courses (MOOCs). The model captures students interactions with the\nonline platform. The data model is platform agnostic and is based on some basic\ncore actions that students take on an online learning platform. Students\nusually interact with the platform in four different modes: Observing,\nSubmitting, Collaborating and giving feedback. In observing mode students are\nsimply browsing the online platform, watching videos, reading material, reading\nbook or watching forums. In submitting mode, students submit information to the\nplatform. This includes submissions towards quizzes, homeworks, or any\nassessment modules. In collaborating mode students interact with other students\nor instructors on forums, collaboratively editing wiki or chatting on google\nhangout or other hangout venues. With this basic definitions of activities, and\na data model to store events pertaining to these activities, we then create a\ncommon terminology to map Coursera and edX data into this shared data model.\nThis shared data model called MOOCdb becomes the foundation for a number of\ncollaborative frameworks that enable progress in data science without the need\nto share the data.\n","authors":"Kalyan Veeramachaneni|Sherif Halawa|Franck Dernoncourt|Una-May O'Reilly|Colin Taylor|Chuong Do","affiliations":"","link_abstract":"http://arxiv.org/abs/1406.2015v1","link_pdf":"http://arxiv.org/pdf/1406.2015v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.IR","categories":"cs.IR|cs.CY|cs.DB"} {"id":"1407.5238v1","submitted":"2014-07-20 02:45:35","updated":"2014-07-20 02:45:35","title":"Towards Feature Engineering at Scale for Data from Massive Open Online\n Courses","abstract":" We examine the process of engineering features for developing models that\nimprove our understanding of learners' online behavior in MOOCs. Because\nfeature engineering relies so heavily on human insight, we argue that extra\neffort should be made to engage the crowd for feature proposals and even their\noperationalization. We show two approaches where we have started to engage the\ncrowd. We also show how features can be evaluated for their relevance in\npredictive accuracy. When we examined crowd-sourced features in the context of\npredicting stopout, not only were they nuanced, but they also considered more\nthan one interaction mode between the learner and platform and how the learner\nwas relatively performing. We were able to identify different influential\nfeatures for stop out prediction that depended on whether a learner was in 1 of\n4 cohorts defined by their level of engagement with the course discussion forum\nor wiki. This report is part of a compendium which considers different aspects\nof MOOC data science and stop out prediction.\n","authors":"Kalyan Veeramachaneni|Una-May O'Reilly|Colin Taylor","affiliations":"","link_abstract":"http://arxiv.org/abs/1407.5238v1","link_pdf":"http://arxiv.org/pdf/1407.5238v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1408.3170v1","submitted":"2014-08-14 00:21:59","updated":"2014-08-14 00:21:59","title":"The Value of Using Big Data Technologies in Computational Social Science","abstract":" The discovery of phenomena in social networks has prompted renewed interests\nin the field. Data in social networks however can be massive, requiring\nscalable Big Data architecture. Conversely, research in Big Data needs the\nvolume and velocity of social media data for testing its scalability. Not only\nso, appropriate data processing and mining of acquired datasets involve complex\nissues in the variety, veracity, and variability of the data, after which\nvisualisation must occur before we can see fruition in our efforts. This\narticle presents topical, multimodal, and longitudinal social media datasets\nfrom the integration of various scalable open source technologies. The article\ndetails the process that led to the discovery of social information landscapes\nwithin the Twitter social network, highlighting the experience of dealing with\nsocial media datasets, using a funneling approach so that data becomes\nmanageable. The article demonstrated the feasibility and value of using\nscalable open source technologies for acquiring massive, connected datasets for\nresearch in the social sciences.\n","authors":"Eugene Ch'ng","affiliations":"","link_abstract":"http://arxiv.org/abs/1408.3170v1","link_pdf":"http://arxiv.org/pdf/1408.3170v1","link_doi":"","comment":"3rd ASE Big Data Science Conference, Tsinghua University Beijing, 3-7\n August 2014","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI|physics.soc-ph|J.4; H.2.8"} {"id":"1408.5090v1","submitted":"2014-08-21 18:08:12","updated":"2014-08-21 18:08:12","title":"Spectroscopic accuracy directly from quantum chemistry: application to\n ground and excited states of beryllium dimer","abstract":" We combine explicit correlation via the canonical transcorrelation approach\nwith the density matrix renormalization group and initiator full configuration\ninteraction quantum Monte Carlo methods to compute a near-exact beryllium dimer\ncurve, {\\it without} the use of composite methods. In particular, our direct\ndensity matrix renormalization group calculations produce a well-depth of\n$D_e$=931.2 cm$^{-1}$ which agrees very well with recent experimentally derived\nestimates $D_e$=929.7$\\pm 2$~cm$^{-1}$ [Science, 324, 1548 (2009)] and\n$D_e$=934.6~cm$^{-1}$ [Science, 326, 1382 (2009)]], as well the best composite\ntheoretical estimates, $D_e$=938$\\pm 15$~cm$^{-1}$ [J. Phys. Chem. A, 111,\n12822 (2007)] and $D_e$=935.1$\\pm 10$~cm$^{-1}$ [Phys. Chem. Chem. Phys., 13,\n20311 (2011)]. Our results suggest possible inaccuracies in the functional form\nof the potential used at shorter bond lengths to fit the experimental data\n[Science, 324, 1548 (2009)]. With the density matrix renormalization group we\nalso compute near-exact vertical excitation energies at the equilibrium\ngeometry. These provide non-trivial benchmarks for quantum chemical methods for\nexcited states, and illustrate the surprisingly large error that remains for\n1$^1\\Sigma^-_g$ state with approximate multi-reference configuration\ninteraction and equation-of-motion coupled cluster methods. Overall, we\ndemonstrate that explicitly correlated density matrix renormalization group and\ninitiator full configuration interaction quantum Monte Carlo methods allow us\nto fully converge to the basis set and correlation limit of the\nnon-relativistic Schr\\\"odinger equation in small molecules.\n","authors":"Sandeep Sharma|Takeshi Yanai|George H. Booth|C. J. Umrigar|Garnet Kin-Lic Chan","affiliations":"","link_abstract":"http://arxiv.org/abs/1408.5090v1","link_pdf":"http://arxiv.org/pdf/1408.5090v1","link_doi":"http://dx.doi.org/10.1063/1.4867383","comment":"","journal_ref":"Journal of Chemical Physics 140, 104112 (2014)","doi":"10.1063/1.4867383","primary_category":"physics.chem-ph","categories":"physics.chem-ph|cond-mat.str-el|physics.atom-ph|physics.comp-ph"} {"id":"1408.5661v3","submitted":"2014-08-25 04:44:53","updated":"2015-04-17 06:59:26","title":"Asymptotic Accuracy of Bayesian Estimation for a Single Latent Variable","abstract":" In data science and machine learning, hierarchical parametric models, such as\nmixture models, are often used. They contain two kinds of variables: observable\nvariables, which represent the parts of the data that can be directly measured,\nand latent variables, which represent the underlying processes that generate\nthe data. Although there has been an increase in research on the estimation\naccuracy for observable variables, the theoretical analysis of estimating\nlatent variables has not been thoroughly investigated. In a previous study, we\ndetermined the accuracy of a Bayes estimation for the joint probability of the\nlatent variables in a dataset, and we proved that the Bayes method is\nasymptotically more accurate than the maximum-likelihood method. However, the\naccuracy of the Bayes estimation for a single latent variable remains unknown.\nIn the present paper, we derive the asymptotic expansions of the error\nfunctions, which are defined by the Kullback-Leibler divergence, for two types\nof single-variable estimations when the statistical regularity is satisfied.\nOur results indicate that the accuracies of the Bayes and maximum-likelihood\nmethods are asymptotically equivalent and clarify that the Bayes method is only\nadvantageous for multivariable estimations.\n","authors":"Keisuke Yamazaki","affiliations":"","link_abstract":"http://arxiv.org/abs/1408.5661v3","link_pdf":"http://arxiv.org/pdf/1408.5661v3","link_doi":"","comment":"28 pages, 3 figures","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1409.0798v1","submitted":"2014-09-02 17:16:47","updated":"2014-09-02 17:16:47","title":"DataHub: Collaborative Data Science & Dataset Version Management at\n Scale","abstract":" Relational databases have limited support for data collaboration, where teams\ncollaboratively curate and analyze large datasets. Inspired by software version\ncontrol systems like git, we propose (a) a dataset version control system,\ngiving users the ability to create, branch, merge, difference and search large,\ndivergent collections of datasets, and (b) a platform, DataHub, that gives\nusers the ability to perform collaborative data analysis building on this\nversion control system. We outline the challenges in providing dataset version\ncontrol at scale.\n","authors":"Anant Bhardwaj|Souvik Bhattacherjee|Amit Chavan|Amol Deshpande|Aaron J. Elmore|Samuel Madden|Aditya G. Parameswaran","affiliations":"","link_abstract":"http://arxiv.org/abs/1409.0798v1","link_pdf":"http://arxiv.org/pdf/1409.0798v1","link_doi":"","comment":"7 pages","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB"} {"id":"1409.2558v3","submitted":"2014-09-09 00:41:56","updated":"2016-04-07 02:14:59","title":"Penalty methods for a class of non-Lipschitz optimization problems","abstract":" We consider a class of constrained optimization problems with a possibly\nnonconvex non-Lipschitz objective and a convex feasible set being the\nintersection of a polyhedron and a possibly degenerate ellipsoid. Such problems\nhave a wide range of applications in data science, where the objective is used\nfor inducing sparsity in the solutions while the constraint set models the\nnoise tolerance and incorporates other prior information for data fitting. To\nsolve this class of constrained optimization problems, a common approach is the\npenalty method. However, there is little theory on exact penalization for\nproblems with nonconvex and non-Lipschitz objective functions. In this paper,\nwe study the existence of exact penalty parameters regarding local minimizers,\nstationary points and $\\epsilon$-minimizers under suitable assumptions.\nMoreover, we discuss a penalty method whose subproblems are solved via a\nnonmonotone proximal gradient method with a suitable update scheme for the\npenalty parameters, and prove the convergence of the algorithm to a KKT point\nof the constrained problem. Preliminary numerical results demonstrate the\nefficiency of the penalty method for finding sparse solutions of\nunderdetermined linear systems.\n","authors":"Xiaojun Chen|Zhaosong Lu|Ting Kei Pong","affiliations":"","link_abstract":"http://arxiv.org/abs/1409.2558v3","link_pdf":"http://arxiv.org/pdf/1409.2558v3","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.OC","categories":"math.OC|stat.ML"} {"id":"1409.4296v1","submitted":"2014-09-15 15:35:56","updated":"2014-09-15 15:35:56","title":"Commons at the Intersection of Peer Production, Citizen Science, and Big\n Data: Galaxy Zoo","abstract":" The knowledge commons research framework is applied to a case of commons\ngovernance grounded in research in modern astronomy. The case, Galaxy Zoo, is a\nleading example of at least three different contemporary phenomena. In the\nfirst place Galaxy Zoo is a global citizen science project, in which volunteer\nnon-scientists have been recruited to participate in large-scale data analysis\nvia the Internet. In the second place Galaxy Zoo is a highly successful example\nof peer production, sometimes known colloquially as crowdsourcing, by which\ndata are gathered, supplied, and/or analyzed by very large numbers of anonymous\nand pseudonymous contributors to an enterprise that is centrally coordinated or\nmanaged. In the third place Galaxy Zoo is a highly visible example of\ndata-intensive science, sometimes referred to as e-science or Big Data science,\nby which scientific researchers develop methods to grapple with the massive\nvolumes of digital data now available to them via modern sensing and imaging\ntechnologies. This chapter synthesizes these three perspectives on Galaxy Zoo\nvia the knowledge commons framework.\n","authors":"Michael J. Madison","affiliations":"","link_abstract":"http://arxiv.org/abs/1409.4296v1","link_pdf":"http://arxiv.org/pdf/1409.4296v1","link_doi":"","comment":"47 pages. Published in Governing Knowledge Commons, Brett M.\n Frischmann, Michael J. Madison and Katherine J. Strandburg, eds., Oxford\n University Press, 2014","journal_ref":"","doi":"","primary_category":"astro-ph.GA","categories":"astro-ph.GA|astro-ph.IM"} {"id":"1410.3127v3","submitted":"2014-10-12 18:17:04","updated":"2015-08-04 20:16:03","title":"Data Science in Statistics Curricula: Preparing Students to \"Think with\n Data\"","abstract":" A growing number of students are completing undergraduate degrees in\nstatistics and entering the workforce as data analysts. In these positions,\nthey are expected to understand how to utilize databases and other data\nwarehouses, scrape data from Internet sources, program solutions to complex\nproblems in multiple languages, and think algorithmically as well as\nstatistically. These data science topics have not traditionally been a major\ncomponent of undergraduate programs in statistics. Consequently, a curricular\nshift is needed to address additional learning outcomes. The goal of this paper\nis to motivate the importance of data science proficiency and to provide\nexamples and resources for instructors to implement data science in their own\nstatistics curricula. We provide case studies from seven institutions. These\nvaried approaches to teaching data science demonstrate curricular innovations\nto address new needs. Also included here are examples of assignments designed\nfor courses that foster engagement of undergraduates with data and data\nscience.\n","authors":"Johanna Hardin|Roger Hoerl|Nicholas J. Horton|Deborah Nolan","affiliations":"","link_abstract":"http://arxiv.org/abs/1410.3127v3","link_pdf":"http://arxiv.org/pdf/1410.3127v3","link_doi":"http://dx.doi.org/10.1080/00031305.2015.1077729","comment":"","journal_ref":"","doi":"10.1080/00031305.2015.1077729","primary_category":"stat.OT","categories":"stat.OT"} {"id":"1410.5631v2","submitted":"2014-10-21 12:12:43","updated":"2014-11-01 15:55:08","title":"Data Driven Discovery in Astrophysics","abstract":" We review some aspects of the current state of data-intensive astronomy, its\nmethods, and some outstanding data analysis challenges. Astronomy is at the\nforefront of \"big data\" science, with exponentially growing data volumes and\ndata rates, and an ever-increasing complexity, now entering the Petascale\nregime. Telescopes and observatories from both ground and space, covering a\nfull range of wavelengths, feed the data via processing pipelines into\ndedicated archives, where they can be accessed for scientific analysis. Most of\nthe large archives are connected through the Virtual Observatory framework,\nthat provides interoperability standards and services, and effectively\nconstitutes a global data grid of astronomy. Making discoveries in this\noverabundance of data requires applications of novel, machine learning tools.\nWe describe some of the recent examples of such applications.\n","authors":"G. Longo|M. Brescia|S. G. Djorgovski|S. Cavuoti|C. Donalek","affiliations":"","link_abstract":"http://arxiv.org/abs/1410.5631v2","link_pdf":"http://arxiv.org/pdf/1410.5631v2","link_doi":"","comment":"Keynote talk in the proceedings of ESA-ESRIN Conference: Big Data\n from Space 2014, Frascati, Italy, November 12-14, 2014, 8 pages, 2 figures","journal_ref":"","doi":"","primary_category":"astro-ph.IM","categories":"astro-ph.IM"} {"id":"1410.6121v2","submitted":"2014-10-22 17:55:53","updated":"2014-12-09 18:21:56","title":"The Nonequilibrium Many-Body Problem as a paradigm for extreme data\n science","abstract":" Generating big data pervades much of physics. But some problems, which we\ncall extreme data problems, are too large to be treated within big data\nscience. The nonequilibrium quantum many-body problem on a lattice is just such\na problem, where the Hilbert space grows exponentially with system size and\nrapidly becomes too large to fit on any computer (and can be effectively\nthought of as an infinite-sized data set). Nevertheless, much progress has been\nmade with computational methods on this problem, which serve as a paradigm for\nhow one can approach and attack extreme data problems. In addition, viewing\nthese physics problems from a computer-science perspective leads to new\napproaches that can be tried to solve them more accurately and for longer\ntimes. We review a number of these different ideas here.\n","authors":"J. K. Freericks|B. K. Nikolic|O. Frieder","affiliations":"","link_abstract":"http://arxiv.org/abs/1410.6121v2","link_pdf":"http://arxiv.org/pdf/1410.6121v2","link_doi":"http://dx.doi.org/10.1142/S0217979214300217","comment":"33 pages, 7 figures, invited review for Int. J. Mod. Phys. B;\n published version with additional references","journal_ref":"Int J. Mod. Phys. B 28, 1430021 (2014)","doi":"10.1142/S0217979214300217","primary_category":"cond-mat.str-el","categories":"cond-mat.str-el|cond-mat.stat-mech|cs.CC|cs.CE|math-ph|math.MP"} {"id":"1410.6646v1","submitted":"2014-10-24 11:03:07","updated":"2014-10-24 11:03:07","title":"Stock fluctuations are correlated and amplified across networks of\n interlocking directorates","abstract":" Traded corporations are required by law to have a majority of outside\ndirectors on their board. This requirement allows the existence of directors\nwho sit on the board of two or more corporations at the same time, generating\nwhat is commonly known as interlocking directorates. While research has shown\nthat networks of interlocking directorates facilitate the transmission of\ninformation between corporations, little is known about the extent to which\nsuch interlocking networks can explain the fluctuations of stock price returns.\nYet, this is a special concern since the risk of amplifying stock fluctuations\nis latent. To answer this question, here we analyze the board composition,\ntraders' perception, and stock performance of more than 1500 US traded\ncorporations from 2007-2011. First, we find that the fewer degrees of\nseparation between two corporations in the interlocking network, the stronger\nthe temporal correlation between their stock price returns. Second, we find\nthat the centrality of traded corporations in the interlocking network\ncorrelates with the frequency at which financial traders talk about such\ncorporations, and this frequency is in turn proportional to the corresponding\ntraded volume. Third, we show that the centrality of corporations was\nnegatively associated with their stock performance in 2008, the year of the big\nfinancial crash. These results suggest that the strategic decisions made by\ninterlocking directorates are strongly followed by stock analysts and have the\npotential to correlate and amplify the movement of stock prices during\nfinancial crashes. These results may have relevant implications for scholars,\ninvestors, and regulators.\n","authors":"Serguei Saavedra|Luis J. Gilarranz|Rudolf P. Rohr|Michael Schnabel|Brian Uzzi|Jordi Bascompte","affiliations":"","link_abstract":"http://arxiv.org/abs/1410.6646v1","link_pdf":"http://arxiv.org/pdf/1410.6646v1","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-014-0030-0","comment":"","journal_ref":"EPJ Data Science 3: 30 (2014)","doi":"10.1140/epjds/s13688-014-0030-0","primary_category":"q-fin.GN","categories":"q-fin.GN|physics.soc-ph"} {"id":"1411.5014v1","submitted":"2014-11-18 14:19:28","updated":"2014-11-18 14:19:28","title":"Music Data Analysis: A State-of-the-art Survey","abstract":" Music accounts for a significant chunk of interest among various online\nactivities. This is reflected by wide array of alternatives offered in music\nrelated web/mobile apps, information portals, featuring millions of artists,\nsongs and events attracting user activity at similar scale. Availability of\nlarge scale structured and unstructured data has attracted similar level of\nattention by data science community. This paper attempts to offer current\nstate-of-the-art in music related analysis. Various approaches involving\nmachine learning, information theory, social network analysis, semantic web and\nlinked open data are represented in the form of taxonomy along with data\nsources and use cases addressed by the research community.\n","authors":"Shubhanshu Gupta","affiliations":"","link_abstract":"http://arxiv.org/abs/1411.5014v1","link_pdf":"http://arxiv.org/pdf/1411.5014v1","link_doi":"","comment":"10 pages, 6 figures","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB|cs.LG|cs.SD|97M80|H.5.5; J.5"} {"id":"1411.5945v1","submitted":"2014-11-21 16:35:59","updated":"2014-11-21 16:35:59","title":"Inequality and cumulative advantage in science careers: a case study of\n high-impact journals","abstract":" Analyzing a large data set of publications drawn from the most competitive\njournals in the natural and social sciences we show that research careers\nexhibit the broad distributions of individual achievement characteristic of\nsystems in which cumulative advantage plays a key role. While most researchers\nare personally aware of the competition implicit in the publication process,\nlittle is known about the levels of inequality at the level of individual\nresearchers. We analyzed both productivity and impact measures for a large set\nof researchers publishing in high-impact journals. For each researcher cohort\nwe calculated Gini inequality coefficients, with average Gini values around\n0.48 for total publications and 0.73 for total citations. For perspective,\nthese observed values are well in excess of the inequality levels observed for\npersonal income in developing countries. Investigating possible sources of this\ninequality, we identify two potential mechanisms that act at the level of the\nindividual that may play defining roles in the emergence of the broad\nproductivity and impact distributions found in science. First, we show that the\naverage time interval between a researcher's successive publications in top\njournals decreases with each subsequent publication. Second, after controlling\nfor the time dependent features of citation distributions, we compare the\ncitation impact of subsequent publications within a researcher's publication\nrecord. We find that as researchers continue to publish in top journals, there\nis more likely to be a decreasing trend in the relative citation impact with\neach subsequent publication. This pattern highlights the difficulty of\nrepeatedly publishing high-impact research and the intriguing possibility that\nconfirmation bias plays a role in the evaluation of scientific careers.\n","authors":"Alexander M. Petersen|Orion Penner","affiliations":"","link_abstract":"http://arxiv.org/abs/1411.5945v1","link_pdf":"http://arxiv.org/pdf/1411.5945v1","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-014-0024-y","comment":"2-page summary of the long published version which is available open\n access here at http://www.epjdatascience.com/content/3/1/24","journal_ref":"EPJ Data Science 3, 24 (2014)","doi":"10.1140/epjds/s13688-014-0024-y","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.DL"} {"id":"1411.7753v1","submitted":"2014-11-28 05:08:42","updated":"2014-11-28 05:08:42","title":"On Low Discrepancy Samplings in Product Spaces of Motion Groups","abstract":" Deterministically generating near-uniform point samplings of the motion\ngroups like SO(3), SE(3) and their n-wise products SO(3)^n, SE(3)^n is\nfundamental to numerous applications in computational and data sciences. The\nnatural measure of sampling quality is discrepancy. In this work, our main goal\nis construct low discrepancy deterministic samplings in product spaces of the\nmotion groups. To this end, we develop a novel strategy (using a two-step\ndiscrepancy construction) that leads to an almost exponential improvement in\nsize (from the trivial direct product). To the best of our knowledge, this is\nthe first nontrivial construction for SO(3)^n, SE(3)^n and the hypertorus T^n.\n We also construct new low discrepancy samplings of S^2 and SO(3). The central\ncomponent in our construction for SO(3) is an explicit construction of N points\nin S^2 with discrepancy \\tilde{\\O}(1/\\sqrt{N}) with respect to convex sets,\nmatching the bound achieved for the special case of spherical caps in\n\\cite{ABD_12}. We also generalize the discrepancy of Cartesian product sets\n\\cite{Chazelle04thediscrepancy} to the discrepancy of local Cartesian product\nsets.\n The tools we develop should be useful in generating low discrepancy samplings\nof other complicated geometric spaces.\n","authors":"Chandrajit Bajaj|Abhishek Bhowmick|Eshan Chattopadhyay|David Zuckerman","affiliations":"","link_abstract":"http://arxiv.org/abs/1411.7753v1","link_pdf":"http://arxiv.org/pdf/1411.7753v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CG","categories":"cs.CG|52CXX (Primary) 68Q25, 68W01 (Secondary)|I.3.5; F.2.2"} {"id":"1501.05039v1","submitted":"2015-01-21 02:41:55","updated":"2015-01-21 02:41:55","title":"Defining Data Science","abstract":" Data science is gaining more and more and widespread attention, but no\nconsensus viewpoint on what data science is has emerged. As a new science, its\nobjects of study and scientific issues should not be covered by established\nsciences. Data in cyberspace have formed what we call datanature. In the\npresent paper, data science is defined as the science of exploring datanature.\n","authors":"Yangyong Zhu|Yun Xiong","affiliations":"","link_abstract":"http://arxiv.org/abs/1501.05039v1","link_pdf":"http://arxiv.org/pdf/1501.05039v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB|cs.CY"} {"id":"1502.00318v1","submitted":"2015-02-01 21:43:51","updated":"2015-02-01 21:43:51","title":"Setting the stage for data science: integration of data management\n skills in introductory and second courses in statistics","abstract":" Many have argued that statistics students need additional facility to express\nstatistical computations. By introducing students to commonplace tools for data\nmanagement, visualization, and reproducible analysis in data science and\napplying these to real-world scenarios, we prepare them to think statistically.\nIn an era of increasingly big data, it is imperative that students develop\ndata-related capacities, beginning with the introductory course. We believe\nthat the integration of these precursors to data science into our\ncurricula-early and often-will help statisticians be part of the dialogue\nregarding \"Big Data\" and \"Big Questions\".\n","authors":"Nicholas J. Horton|Benjamin S. Baumer|Hadley Wickham","affiliations":"","link_abstract":"http://arxiv.org/abs/1502.00318v1","link_pdf":"http://arxiv.org/pdf/1502.00318v1","link_doi":"http://dx.doi.org/10.1080/09332480.2015.1042739","comment":"","journal_ref":"","doi":"10.1080/09332480.2015.1042739","primary_category":"stat.CO","categories":"stat.CO|cs.CY|stat.OT|62-01"} {"id":"1502.07994v1","submitted":"2015-02-26 09:02:26","updated":"2015-02-26 09:02:26","title":"An Effective Private Data storage and Retrieval System using Secret\n sharing scheme based on Secure Multi-party Computation","abstract":" Privacy of the outsourced data is one of the major challenge.Insecurity of\nthe network environment and untrustworthiness of the service providers are\nobstacles of making the database as a service.Collection and storage of\npersonally identifiable information is a major privacy concern.On-line public\ndatabases and resources pose a significant risk to user privacy, since a\nmalicious database owner may monitor user queries and infer useful information\nabout the customer.The challenge in data privacy is to share data with\nthird-party and at the same time securing the valuable information from\nunauthorized access and use by third party.A Private Information Retrieval(PIR)\nscheme allows a user to query database while hiding the identity of the data\nretrieved.The naive solution for confidentiality is to encrypt data before\noutsourcing.Query execution,key management and statistical inference are major\nchallenges in this case.The proposed system suggests a mechanism for secure\nstorage and retrieval of private data using the secret sharing technique.The\nidea is to develop a mechanism to store private information with a highly\navailable storage provider which could be accessed from anywhere using queries\nwhile hiding the actual data values from the storage provider.The private\ninformation retrieval system is implemented using Secure Multi-party\nComputation(SMC) technique which is based on secret sharing. Multi-party\nComputation enable parties to compute some joint function over their private\ninputs.The query results are obtained by performing a secure computation on the\nshares owned by the different servers.\n","authors":"Divya G. Nair|V. P. Binu|G. Santhosh Kumar","affiliations":"","link_abstract":"http://arxiv.org/abs/1502.07994v1","link_pdf":"http://arxiv.org/pdf/1502.07994v1","link_doi":"http://dx.doi.org/10.1109/ICDSE.2014.6974639","comment":"Data Science & Engineering (ICDSE), 2014 International Conference,\n CUSAT","journal_ref":"","doi":"10.1109/ICDSE.2014.6974639","primary_category":"cs.CR","categories":"cs.CR"} {"id":"1503.00244v1","submitted":"2015-03-01 09:41:11","updated":"2015-03-01 09:41:11","title":"23-bit Metaknowledge Template Towards Big Data Knowledge Discovery and\n Management","abstract":" The global influence of Big Data is not only growing but seemingly endless.\nThe trend is leaning towards knowledge that is attained easily and quickly from\nmassive pools of Big Data. Today we are living in the technological world that\nDr. Usama Fayyad and his distinguished research fellows discussed in the\nintroductory explanations of Knowledge Discovery in Databases (KDD) predicted\nnearly two decades ago. Indeed, they were precise in their outlook on Big Data\nanalytics. In fact, the continued improvement of the interoperability of\nmachine learning, statistics, database building and querying fused to create\nthis increasingly popular science- Data Mining and Knowledge Discovery. The\nnext generation computational theories are geared towards helping to extract\ninsightful knowledge from even larger volumes of data at higher rates of speed.\nAs the trend increases in popularity, the need for a highly adaptive solution\nfor knowledge discovery will be necessary. In this research paper, we are\nintroducing the investigation and development of 23 bit-questions for a\nMetaknowledge template for Big Data Processing and clustering purposes. This\nresearch aims to demonstrate the construction of this methodology and proves\nthe validity and the beneficial utilization that brings Knowledge Discovery\nfrom Big Data.\n","authors":"Nima Bari|Roman Vichr|Kamran Kowsari|Simon Y. Berkovich","affiliations":"","link_abstract":"http://arxiv.org/abs/1503.00244v1","link_pdf":"http://arxiv.org/pdf/1503.00244v1","link_doi":"http://dx.doi.org/10.1109/DSAA.2014.7058121","comment":"IEEE Data Science and Advanced Analytics (DSAA'2014)","journal_ref":"","doi":"10.1109/DSAA.2014.7058121","primary_category":"cs.DB","categories":"cs.DB|cs.AI|cs.IR|cs.LG"} {"id":"1503.00635v2","submitted":"2015-03-02 17:44:38","updated":"2015-04-23 23:01:04","title":"BayesSummaryStatLM: An R package for Bayesian Linear Models for Big Data\n and Data Science","abstract":" Recent developments in data science and big data research have produced an\nabundance of large data sets that are too big to be analyzed in their entirety,\ndue to limits on either computer memory or storage capacity. Here, we introduce\nour R package 'BayesSummaryStatLM' for Bayesian linear regression models with\nMarkov chain Monte Carlo implementation that overcomes these limitations. Our\nBayesian models use only summary statistics of data as input; these summary\nstatistics can be calculated from subsets of big data and combined over\nsubsets. Thus, complete data sets do not need to be read into memory in full,\nwhich removes any physical memory limitations of a user. Our package\nincorporates the R package 'ff' and its functions for reading in big data sets\nin chunks while simultaneously calculating summary statistics. We describe our\nBayesian linear regression models, including several choices of prior\ndistributions for unknown model parameters, and illustrate capabilities and\nfeatures of our R package using both simulated and real data sets.\n","authors":"Alexey Miroshnikov|Evgeny Savel'ev|Erin M. Conlon","affiliations":"","link_abstract":"http://arxiv.org/abs/1503.00635v2","link_pdf":"http://arxiv.org/pdf/1503.00635v2","link_doi":"","comment":"Updated URL in reference [12]; added to description of zero.intercept\n on p. 11; added minor clarifications throughout; results unchanged","journal_ref":"","doi":"","primary_category":"stat.AP","categories":"stat.AP|stat.CO"} {"id":"1503.01976v1","submitted":"2015-03-06 14:40:56","updated":"2015-03-06 14:40:56","title":"Comment on \"Stellar activity masquerading as planets in the habitable\n zone of the M dwarf Gliese 581\"","abstract":" Robertson et al.(Reports, July 25 2014, p440-444)(1) claimed that\nactivity-induced variability is responsible for the Doppler signal of the\nproposed planet candidate GJ 581d. We point out that their analysis using\nperiodograms of residual data is incorrect, further promoting inadequate tools.\nSince the claim challenges the viability of the method to detect exo-Earths, we\nurge for more appropriate analyses (see appendix).\n","authors":"Guillem Anglada-Escudé|Mikko Tuomi","affiliations":"","link_abstract":"http://arxiv.org/abs/1503.01976v1","link_pdf":"http://arxiv.org/pdf/1503.01976v1","link_doi":"http://dx.doi.org/10.1126/science.1260796","comment":"15 pages, 4 figures. Includes appendix with full re-analysis of the\n data","journal_ref":"Science, 6 March 2015. Vol. 347 no. 6226 p. 1080","doi":"10.1126/science.1260796","primary_category":"astro-ph.EP","categories":"astro-ph.EP|astro-ph.IM"} {"id":"1503.02188v3","submitted":"2015-03-07 16:46:59","updated":"2015-04-28 14:51:44","title":"Challenges and opportunities for statistics and statistical education:\n looking back, looking forward","abstract":" The 175th anniversary of the ASA provides an opportunity to look back into\nthe past and peer into the future. What led our forebears to found the\nassociation? What commonalities do we still see? What insights might we glean\nfrom their experiences and observations? I will use the anniversary as a chance\nto reflect on where we are now and where we are headed in terms of statistical\neducation amidst the growth of data science. Statistics is the science of\nlearning from data. By fostering more multivariable thinking, building\ndata-related skills, and developing simulation-based problem solving, we can\nhelp to ensure that statisticians are fully engaged in data science and the\nanalysis of the abundance of data now available to us.\n","authors":"Nicholas Jon Horton","affiliations":"","link_abstract":"http://arxiv.org/abs/1503.02188v3","link_pdf":"http://arxiv.org/pdf/1503.02188v3","link_doi":"http://dx.doi.org/10.1080/00031305.2015.1032435","comment":"In press: The American Statistician","journal_ref":"","doi":"10.1080/00031305.2015.1032435","primary_category":"stat.OT","categories":"stat.OT|stat.CO"} {"id":"1503.02484v2","submitted":"2015-03-09 14:23:42","updated":"2015-07-17 13:36:21","title":"Win-stay lose-shift strategy in formation changes in football","abstract":" Managerial decision making is likely to be a dominant determinant of\nperformance of teams in team sports. Here we use Japanese and German football\ndata to investigate correlates between temporal patterns of formation changes\nacross matches and match results. We found that individual teams and managers\nboth showed win-stay lose-shift behavior, a type of reinforcement learning. In\nother words, they tended to stick to the current formation after a win and\nswitch to a different formation after a loss. In addition, formation changes\ndid not statistically improve the results of succeeding matches.The results\nindicate that a swift implementation of a new formation in the win-stay\nlose-shift manner may not be a successful managerial rule of thumb.\n","authors":"Kohei Tamura|Naoki Masuda","affiliations":"","link_abstract":"http://arxiv.org/abs/1503.02484v2","link_pdf":"http://arxiv.org/pdf/1503.02484v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-015-0045-1","comment":"7 figures, 11 tables","journal_ref":"EPJ Data Science, 4, 9 (2015)","doi":"10.1140/epjds/s13688-015-0045-1","primary_category":"physics.soc-ph","categories":"physics.soc-ph"} {"id":"1503.05502v3","submitted":"2015-03-18 17:30:21","updated":"2015-05-30 11:37:20","title":"Urban Magnetism Through The Lens of Geo-tagged Photography","abstract":" There is an increasing trend of people leaving digital traces through social\nmedia. This reality opens new horizons for urban studies. With this kind of\ndata, researchers and urban planners can detect many aspects of how people live\nin cities and can also suggest how to transform cities into more efficient and\nsmarter places to live in. In particular, their digital trails can be used to\ninvestigate tastes of individuals, and what attracts them to live in a\nparticular city or to spend their vacation there. In this paper we propose an\nunconventional way to study how people experience the city, using information\nfrom geotagged photographs that people take at different locations. We compare\nthe spatial behavior of residents and tourists in 10 most photographed cities\nall around the world. The study was conducted on both a global and local level.\nOn the global scale we analyze the 10 most photographed cities and measure how\nattractive each city is for people visiting it from other cities within the\nsame country or from abroad. For the purpose of our analysis we construct the\nusers mobility network and measure the strength of the links between each pair\nof cities as a level of attraction of people living in one city (i.e., origin)\nto the other city (i.e., destination). On the local level we study the spatial\ndistribution of user activity and identify the photographed hotspots inside\neach city. The proposed methodology and the results of our study are a low cost\nmean to characterize a touristic activity within a certain location and can\nhelp in urban organization to strengthen their touristic potential.\n","authors":"Silvia Paldino|Iva Bojic|Stanislav Sobolevsky|Carlo Ratti|Marta C. Gonzalez","affiliations":"","link_abstract":"http://arxiv.org/abs/1503.05502v3","link_pdf":"http://arxiv.org/pdf/1503.05502v3","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-015-0043-3","comment":"17 pages, 10 figures, 6 tables","journal_ref":"EPJ Data Science 2015, 4:5","doi":"10.1140/epjds/s13688-015-0043-3","primary_category":"cs.SI","categories":"cs.SI|physics.soc-ph|91D30"} {"id":"1503.05570v1","submitted":"2015-03-18 20:05:24","updated":"2015-03-18 20:05:24","title":"A Data Science Course for Undergraduates: Thinking with Data","abstract":" Data science is an emerging interdisciplinary field that combines elements of\nmathematics, statistics, computer science, and knowledge in a particular\napplication domain for the purpose of extracting meaningful information from\nthe increasingly sophisticated array of data available in many settings. These\ndata tend to be non-traditional, in the sense that they are often live, large,\ncomplex, and/or messy. A first course in statistics at the undergraduate level\ntypically introduces students with a variety of techniques to analyze small,\nneat, and clean data sets. However, whether they pursue more formal training in\nstatistics or not, many of these students will end up working with data that is\nconsiderably more complex, and will need facility with statistical computing\ntechniques. More importantly, these students require a framework for thinking\nstructurally about data. We describe an undergraduate course in a liberal arts\nenvironment that provides students with the tools necessary to apply data\nscience. The course emphasizes modern, practical, and useful skills that cover\nthe full data analysis spectrum, from asking an interesting question to\nacquiring, managing, manipulating, processing, querying, analyzing, and\nvisualizing data, as well communicating findings in written, graphical, and\noral forms.\n","authors":"Ben Baumer","affiliations":"","link_abstract":"http://arxiv.org/abs/1503.05570v1","link_pdf":"http://arxiv.org/pdf/1503.05570v1","link_doi":"","comment":"21 pages total including supplementary materials","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT|cs.CY|stat.CO|62-01"} {"id":"1503.05638v2","submitted":"2015-03-19 02:54:21","updated":"2015-09-21 04:31:50","title":"Entropy-scaling search of massive biological data","abstract":" Many datasets exhibit a well-defined structure that can be exploited to\ndesign faster search tools, but it is not always clear when such acceleration\nis possible. Here, we introduce a framework for similarity search based on\ncharacterizing a dataset's entropy and fractal dimension. We prove that\nsearching scales in time with metric entropy (number of covering hyperspheres),\nif the fractal dimension of the dataset is low, and scales in space with the\nsum of metric entropy and information-theoretic entropy (randomness of the\ndata). Using these ideas, we present accelerated versions of standard tools,\nwith no loss in specificity and little loss in sensitivity, for use in three\ndomains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics\n(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search\n(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve\n\"compressive omics,\" and the general theory can be readily applied to data\nscience problems outside of biology.\n","authors":"Y. William Yu|Noah M. Daniels|David Christian Danko|Bonnie Berger","affiliations":"","link_abstract":"http://arxiv.org/abs/1503.05638v2","link_pdf":"http://arxiv.org/pdf/1503.05638v2","link_doi":"http://dx.doi.org/10.1016/j.cels.2015.08.004","comment":"Including supplement: 41 pages, 6 figures, 4 tables, 1 box","journal_ref":"Cell Systems, Volume 1, Issue 2, 130-140, 2015","doi":"10.1016/j.cels.2015.08.004","primary_category":"cs.DS","categories":"cs.DS|q-bio.GN"} {"id":"1503.06201v1","submitted":"2015-03-20 19:31:09","updated":"2015-03-20 19:31:09","title":"Data Science as a New Frontier for Design","abstract":" The purpose of this paper is to contribute to the challenge of transferring\nknow-how, theories and methods from design research to the design processes in\ninformation science and technologies. More specifically, we shall consider a\ndomain, namely data-science, that is becoming rapidly a globally invested\nresearch and development axis with strong imperatives for innovation given the\ndata deluge we are currently facing. We argue that, in order to rise to the\ndata-related challenges that the society is facing, data-science initiatives\nshould ensure a renewal of traditional research methodologies that are still\nlargely based on trial-error processes depending on the talent and insights of\na single (or a restricted group of) researchers. It is our claim that design\ntheories and methods can provide, at least to some extent, the much-needed\nframework. We will use a worldwide data-science challenge organized to study a\ntechnical problem in physics, namely the detection of Higgs boson, as a use\ncase to demonstrate some of the ways in which design theory and methods can\nhelp in analyzing and shaping the innovation dynamics in such projects.\n","authors":"Akin Osman|Kazakçi Mines","affiliations":"CGS|CGS","link_abstract":"http://arxiv.org/abs/1503.06201v1","link_pdf":"http://arxiv.org/pdf/1503.06201v1","link_doi":"","comment":"International Conference on Engineering Design, Jul 2015, Milan,\n Italy","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI|stat.OT"} {"id":"1504.00238v2","submitted":"2015-04-01 14:15:22","updated":"2020-07-22 17:20:46","title":"Sample Size Calculations for Micro-randomized Trials in mHealth","abstract":" The use and development of mobile interventions are experiencing rapid\ngrowth. In \"just-in-time\" mobile interventions, treatments are provided via a\nmobile device and they are intended to help an individual make healthy\ndecisions \"in the moment,\" and thus have a proximal, near future impact.\nCurrently the development of mobile interventions is proceeding at a much\nfaster pace than that of associated data science methods. A first step toward\ndeveloping data-based methods is to provide an experimental design for testing\nthe proximal effects of these just-in-time treatments. In this paper, we\npropose a \"micro-randomized\" trial design for this purpose. In a\nmicro-randomized trial, treatments are sequentially randomized throughout the\nconduct of the study, with the result that each participant may be randomized\nat the 100s or 1000s of occasions at which a treatment might be provided.\nFurther, we develop a test statistic for assessing the proximal effect of a\ntreatment as well as an associated sample size calculator. We conduct\nsimulation evaluations of the sample size calculator in various settings. Rules\nof thumb that might be used in designing a micro-randomized trial are\ndiscussed. This work is motivated by our collaboration on the HeartSteps mobile\napplication designed to increase physical activity.\n","authors":"Peng Liao|Predrag Klasnja|Ambuj Tewari|Susan A. Murphy","affiliations":"","link_abstract":"http://arxiv.org/abs/1504.00238v2","link_pdf":"http://arxiv.org/pdf/1504.00238v2","link_doi":"http://dx.doi.org/10.1002/sim.6847","comment":"29 pages, 5 figures, 18 tables","journal_ref":"Statistics in medicine 35, no. 12 (2016): 1944-1971","doi":"10.1002/sim.6847","primary_category":"stat.ME","categories":"stat.ME"} {"id":"1504.01362v7","submitted":"2015-04-06 19:18:49","updated":"2016-05-29 14:07:02","title":"A New Approach to Building the Interindustry Input--Output Table","abstract":" We present a new approach to estimating the interdependence of industries in\nan economy by applying data science solutions. By exploiting interfirm\nbuyer--seller network data, we show that the problem of estimating the\ninterdependence of industries is similar to the problem of uncovering the\nlatent block structure in network science literature. To estimate the\nunderlying structure with greater accuracy, we propose an extension of the\nsparse block model that incorporates node textual information and an unbounded\nnumber of industries and interactions among them. The latter task is\naccomplished by extending the well-known Chinese restaurant process to two\ndimensions. Inference is based on collapsed Gibbs sampling, and the model is\nevaluated on both synthetic and real-world datasets. We show that the proposed\nmodel improves in predictive accuracy and successfully provides a satisfactory\nsolution to the motivated problem. We also discuss issues that affect the\nfuture performance of this approach.\n","authors":"Ryohei Hisano","affiliations":"","link_abstract":"http://arxiv.org/abs/1504.01362v7","link_pdf":"http://arxiv.org/pdf/1504.01362v7","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML"} {"id":"1504.02878v1","submitted":"2015-04-11 14:14:08","updated":"2015-04-11 14:14:08","title":"Data Science and Ebola","abstract":" Data Science---Today, everybody and everything produces data. People produce\nlarge amounts of data in social networks and in commercial transactions.\nMedical, corporate, and government databases continue to grow. Sensors continue\nto get cheaper and are increasingly connected, creating an Internet of Things,\nand generating even more data. In every discipline, large, diverse, and rich\ndata sets are emerging, from astrophysics, to the life sciences, to the\nbehavioral sciences, to finance and commerce, to the humanities and to the\narts. In every discipline people want to organize, analyze, optimize and\nunderstand their data to answer questions and to deepen insights. The science\nthat is transforming this ocean of data into a sea of knowledge is called data\nscience. This lecture will discuss how data science has changed the way in\nwhich one of the most visible challenges to public health is handled, the 2014\nEbola outbreak in West Africa.\n","authors":"Aske Plaat","affiliations":"","link_abstract":"http://arxiv.org/abs/1504.02878v1","link_pdf":"http://arxiv.org/pdf/1504.02878v1","link_doi":"","comment":"Inaugural lecture Leiden University","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI|cs.CY"} {"id":"1505.01818v4","submitted":"2015-05-05 21:42:46","updated":"2016-01-22 23:14:35","title":"Wikipedia traffic data and electoral prediction: towards theoretically\n informed models","abstract":" This aim of this article is to explore the potential use of Wikipedia page\nview data for predicting electoral results. Responding to previous critiques of\nwork using socially generated data to predict elections, which have argued that\nthese predictions take place without any understanding of the mechanism which\nenables them, we first develop a theoretical model which highlights why people\nmight seek information online at election time, and how this activity might\nrelate to overall electoral outcomes, focussing especially on how different\ntypes of parties such as new and established parties might generate different\ninformation seeking patterns. We test this model on a novel dataset drawn from\na variety of countries in the 2009 and 2014 European Parliament elections. We\nshow that while Wikipedia offers little insight into absolute vote outcomes, it\noffers a good information about changes in both overall turnout at elections\nand in vote share for particular parties. These results are used to enhance\nexisting theories about the drivers of aggregate patterns in online information\nseeking.\n","authors":"Taha Yasseri|Jonathan Bright","affiliations":"","link_abstract":"http://arxiv.org/abs/1505.01818v4","link_pdf":"http://arxiv.org/pdf/1505.01818v4","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-016-0083-3","comment":"submitted to EPJ Data Science. Additional File 1 available at\n https://drive.google.com/open?id=0BxaGC-YCTO6SWkJhRXlrMVRYVlE","journal_ref":"EPJ Data Science, 5: 22 (2016)","doi":"10.1140/epjds/s13688-016-0083-3","primary_category":"cs.SI","categories":"cs.SI|physics.soc-ph"} {"id":"1505.02818v1","submitted":"2015-05-11 22:02:49","updated":"2015-05-11 22:02:49","title":"Investigating Causality in Human Behavior from Smartphone Sensor Data: A\n Quasi-Experimental Approach","abstract":" Smartphones have become an indispensable part of our daily life. Their\nimproved sensing and computing capabilities bring new opportunities for human\nbehavior monitoring and analysis. Most work so far has been focused on\ndetecting correlation rather than causation among features extracted from\nsmartphone data. However, pure correlation analysis does not offer sufficient\nunderstanding of human behavior. Moreover, causation analysis could allow\nscientists to identify factors that have a causal effect on health and\nwell-being issues, such as obesity, stress, depression and so on and suggest\nactions to deal with them. Finally, detecting causal relationships in this kind\nof observational data is challenging since, in general, subjects cannot be\nrandomly exposed to an event.\n In this article, we discuss the design, implementation and evaluation of a\ngeneric quasi-experimental framework for conducting causation studies on human\nbehavior from smartphone data. We demonstrate the effectiveness of our approach\nby investigating the causal impact of several factors such as exercise, social\ninteractions and work on stress level. Our results indicate that exercising and\nspending time outside home and working environment have a positive effect on\nparticipants stress level while reduced working hours only slightly impact\nstress.\n","authors":"Fani Tsapeli|Mirco Musolesi","affiliations":"","link_abstract":"http://arxiv.org/abs/1505.02818v1","link_pdf":"http://arxiv.org/pdf/1505.02818v1","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-015-0061-1","comment":"","journal_ref":"EPJ Data Science, 2015","doi":"10.1140/epjds/s13688-015-0061-1","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1505.03776v1","submitted":"2015-05-14 16:08:11","updated":"2015-05-14 16:08:11","title":"Sentiment cascades in the 15M movement","abstract":" Recent grassroots movements have suggested that online social networks might\nplay a key role in their organization, as adherents have a fast, many-to-many,\ncommunication channel to help coordinate their mobilization. The structure and\ndynamics of the networks constructed from the digital traces of protesters have\nbeen analyzed to some extent recently. However, less effort has been devoted to\nthe analysis of the semantic content of messages exchanged during the protest.\nUsing the data obtained from a microblogging service during the brewing and\nactive phases of the 15M movement in Spain, we perform the first large scale\ntest of theories on collective emotions and social interaction in collective\nactions. Our findings show that activity and information cascades in the\nmovement are larger in the presence of negative collective emotions and when\nusers express themselves in terms related to social content. At the level of\nindividual participants, our results show that their social integration in the\nmovement, as measured through social network metrics, increases with their\nlevel of engagement and of expression of negativity. Our findings show that\nnon-rational factors play a role in the formation and activity of social\nmovements through online media, having important consequences for viral\nspreading.\n","authors":"Raquel Alvarez|David Garcia|Yamir Moreno|Frank Schweitzer","affiliations":"","link_abstract":"http://arxiv.org/abs/1505.03776v1","link_pdf":"http://arxiv.org/pdf/1505.03776v1","link_doi":"","comment":"EPJ Data Science vol 4 (2015) (forthcoming)","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI|physics.soc-ph"} {"id":"1505.04935v2","submitted":"2015-05-19 09:58:05","updated":"2015-07-06 13:45:52","title":"Towards Data-Driven Autonomics in Data Centers","abstract":" Continued reliance on human operators for managing data centers is a major\nimpediment for them from ever reaching extreme dimensions. Large computer\nsystems in general, and data centers in particular, will ultimately be managed\nusing predictive computational and executable models obtained through\ndata-science tools, and at that point, the intervention of humans will be\nlimited to setting high-level goals and policies rather than performing\nlow-level operations. Data-driven autonomics, where management and control are\nbased on holistic predictive models that are built and updated using generated\ndata, opens one possible path towards limiting the role of operators in data\ncenters. In this paper, we present a data-science study of a public Google\ndataset collected in a 12K-node cluster with the goal of building and\nevaluating a predictive model for node failures. We use BigQuery, the big data\nSQL platform from the Google Cloud suite, to process massive amounts of data\nand generate a rich feature set characterizing machine state over time. We\ndescribe how an ensemble classifier can be built out of many Random Forest\nclassifiers each trained on these features, to predict if machines will fail in\na future 24-hour window. Our evaluation reveals that if we limit false positive\nrates to 5%, we can achieve true positive rates between 27% and 88% with\nprecision varying between 50% and 72%. We discuss the practicality of including\nour predictive model as the central component of a data-driven autonomic\nmanager and operating it on-line with live data streams (rather than off-line\non data logs). All of the scripts used for BigQuery and classification analyses\nare publicly available from the authors' website.\n","authors":"Alina Sîrbu|Ozalp Babaoglu","affiliations":"","link_abstract":"http://arxiv.org/abs/1505.04935v2","link_pdf":"http://arxiv.org/pdf/1505.04935v2","link_doi":"","comment":"12 pages, 6 figures","journal_ref":"","doi":"","primary_category":"cs.DC","categories":"cs.DC|cs.AI|stat.ML"} {"id":"1505.05211v1","submitted":"2015-05-19 23:45:05","updated":"2015-05-19 23:45:05","title":"Principles of Dataset Versioning: Exploring the Recreation/Storage\n Tradeoff","abstract":" The relative ease of collaborative data science and analysis has led to a\nproliferation of many thousands or millions of $versions$ of the same datasets\nin many scientific and commercial domains, acquired or constructed at various\nstages of data analysis across many users, and often over long periods of time.\nManaging, storing, and recreating these dataset versions is a non-trivial task.\nThe fundamental challenge here is the $storage-recreation\\;trade-off$: the more\nstorage we use, the faster it is to recreate or retrieve versions, while the\nless storage we use, the slower it is to recreate or retrieve versions. Despite\nthe fundamental nature of this problem, there has been a surprisingly little\namount of work on it. In this paper, we study this trade-off in a principled\nmanner: we formulate six problems under various settings, trading off these\nquantities in various ways, demonstrate that most of the problems are\nintractable, and propose a suite of inexpensive heuristics drawing from\ntechniques in delay-constrained scheduling, and spanning tree literature, to\nsolve these problems. We have built a prototype version management system, that\naims to serve as a foundation to our DATAHUB system for facilitating\ncollaborative data science. We demonstrate, via extensive experiments, that our\nproposed heuristics provide efficient solutions in practical dataset versioning\nscenarios.\n","authors":"Souvik Bhattacherjee|Amit Chavan|Silu Huang|Amol Deshpande|Aditya Parameswaran","affiliations":"","link_abstract":"http://arxiv.org/abs/1505.05211v1","link_pdf":"http://arxiv.org/pdf/1505.05211v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB"} {"id":"1505.05425v3","submitted":"2015-05-20 15:40:02","updated":"2016-06-09 20:11:14","title":"Experiences with efficient methodologies for teaching computer\n programming to geoscientists","abstract":" Computer programming was once thought of as a skill required only by\nprofessional software developers. But today, given the ubiquitous nature of\ncomputation and data science it is quickly becoming necessary for all\nscientists and engineers to have at least a basic knowledge of how to program.\nTeaching how to program, particularly to those students with little or no\ncomputing background, is well-known to be a difficult task. However, there is\nalso a wealth of evidence-based teaching practices for teaching programming\nskills which can be applied to greatly improve learning outcomes and the\nstudent experience. Adopting these practices naturally gives rise to greater\nlearning efficiency - this is critical if programming is to be integrated into\nan already busy geoscience curriculum. This paper considers an undergraduate\ncomputer programming course, run during the last 5 years in the Department of\nEarth Science and Engineering at Imperial College London. The teaching\nmethodologies that were used each year are discussed alongside the challenges\nthat were encountered, and how the methodologies affected student performance.\nAnonymised student marks and feedback are used to highlight this, and also how\nthe adjustments made to the course eventually resulted in a highly effective\nlearning environment.\n","authors":"Christian T. Jacobs|Gerard J. Gorman|Huw E. Rees|Lorraine Craig","affiliations":"","link_abstract":"http://arxiv.org/abs/1505.05425v3","link_pdf":"http://arxiv.org/pdf/1505.05425v3","link_doi":"http://dx.doi.org/10.5408/15-101.1","comment":"Second revised version. This version was accepted for publication in\n the Journal of Geoscience Education on 9 June 2016. Contains 5 figures. The\n main change is the inclusion of a new section on outlook and future work","journal_ref":"Journal of Geoscience Education 64(3):183-198, 2016","doi":"10.5408/15-101.1","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1505.06443v1","submitted":"2015-05-24 14:58:41","updated":"2015-05-24 14:58:41","title":"Detecting bird sound in unknown acoustic background using crowdsourced\n training data","abstract":" Biodiversity monitoring using audio recordings is achievable at a truly\nglobal scale via large-scale deployment of inexpensive, unattended recording\nstations or by large-scale crowdsourcing using recording and species\nrecognition on mobile devices. The ability, however, to reliably identify\nvocalising animal species is limited by the fact that acoustic signatures of\ninterest in such recordings are typically embedded in a diverse and complex\nacoustic background. To avoid the problems associated with modelling such\nbackgrounds, we build generative models of bird sounds and use the concept of\nnovelty detection to screen recordings to detect sections of data which are\nlikely bird vocalisations. We present detection results against various\nacoustic environments and different signal-to-noise ratios. We discuss the\nissues related to selecting the cost function and setting detection thresholds\nin such algorithms. Our methods are designed to be scalable and automatically\napplicable to arbitrary selections of species depending on the specific\ngeographic region and time period of deployment.\n","authors":"Timos Papadopoulos|Stephen Roberts|Kathy Willis","affiliations":"","link_abstract":"http://arxiv.org/abs/1505.06443v1","link_pdf":"http://arxiv.org/pdf/1505.06443v1","link_doi":"","comment":"Submitted to 'Big Data Sciences for Bioacoustic Environmental\n Survey', 10th Advanced Multimodal Information Retrieval int'l summer school,\n Ermites 2015","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG|cs.SD"} {"id":"1506.00673v1","submitted":"2015-06-01 20:53:45","updated":"2015-06-01 20:53:45","title":"Mutual Dependence: A Novel Method for Computing Dependencies Between\n Random Variables","abstract":" In data science, it is often required to estimate dependencies between\ndifferent data sources. These dependencies are typically calculated using\nPearson's correlation, distance correlation, and/or mutual information.\nHowever, none of these measures satisfy all the Granger's axioms for an \"ideal\nmeasure\". One such ideal measure, proposed by Granger himself, calculates the\nBhattacharyya distance between the joint probability density function (pdf) and\nthe product of marginal pdfs. We call this measure the mutual dependence.\nHowever, to date this measure has not been directly computable from data. In\nthis paper, we use our recently introduced maximum likelihood non-parametric\nestimator for band-limited pdfs, to compute the mutual dependence directly from\nthe data. We construct the estimator of mutual dependence and compare its\nperformance to standard measures (Pearson's and distance correlation) for\ndifferent known pdfs by computing convergence rates, computational complexity,\nand the ability to capture nonlinear dependencies. Our mutual dependence\nestimator requires fewer samples to converge to theoretical values, is faster\nto compute, and captures more complex dependencies than standard measures.\n","authors":"Rahul Agarwal|Pierre Sacre|Sridevi V. Sarma","affiliations":"","link_abstract":"http://arxiv.org/abs/1506.00673v1","link_pdf":"http://arxiv.org/pdf/1506.00673v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|stat.ML|stat.TH"} {"id":"1506.04094v1","submitted":"2015-06-10 10:56:33","updated":"2015-06-10 10:56:33","title":"The WDAqua ITN: Answering Questions using Web Data","abstract":" WDAqua is a Marie Curie Innovative Training Network (ITN) and is funded under\nEU grant number 642795 and runs from January 2015 to December 2018. WDAqua aims\nat advancing the state of the art by intertwining training, research and\ninnovation efforts, centered around one service: data-driven question\nanswering. Question answering is immediately useful to a wide audience of end\nusers, and we will demonstrate this in settings including e-commerce, public\nsector information, publishing and smart cities. Question answering also covers\nweb science and data science broadly, leading to transferrable research results\nand to transferrable skills of the researchers who have finished our training\nprogramme. To ensure that our research improves question answering overall,\nevery individual research project connects at least two of these steps.\nIntersectional secondments (within a consortium covering academia, research\ninstitutes and industrial research as well as network-wide workshops, R and D\nchallenges and innovation projects further balance ground-breaking research and\nthe needs of society and industry. Training-wise these offers equip early stage\nresearchers with the expertise and transferable technical and non-technical\nskills that will allow them to pursue a successful career as an academic,\ndecision maker, practitioner or entrepreneur.\n","authors":"Christoph Lange|Saeedeh Shekarpour|Soren Auer","affiliations":"","link_abstract":"http://arxiv.org/abs/1506.04094v1","link_pdf":"http://arxiv.org/pdf/1506.04094v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.IR","categories":"cs.IR"} {"id":"1506.04815v1","submitted":"2015-06-16 01:32:51","updated":"2015-06-16 01:32:51","title":"Towards a unified query language for provenance and versioning","abstract":" Organizations and teams collect and acquire data from various sources, such\nas social interactions, financial transactions, sensor data, and genome\nsequencers. Different teams in an organization as well as different data\nscientists within a team are interested in extracting a variety of insights\nwhich require combining and collaboratively analyzing datasets in diverse ways.\nDataHub is a system that aims to provide robust version control and provenance\nmanagement for such a scenario. To be truly useful for collaborative data\nscience, one also needs the ability to specify queries and analysis tasks over\nthe versioning and the provenance information in a unified manner. In this\npaper, we present an initial design of our query language, called VQuel, that\naims to support such unified querying over both types of information, as well\nas the intermediate and final results of analyses. We also discuss some of the\nkey language design and implementation challenges moving forward.\n","authors":"Amit Chavan|Silu Huang|Amol Deshpande|Aaron Elmore|Samuel Madden|Aditya Parameswaran","affiliations":"","link_abstract":"http://arxiv.org/abs/1506.04815v1","link_pdf":"http://arxiv.org/pdf/1506.04815v1","link_doi":"","comment":"Theory and Practice of Provenance, 2015","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB"} {"id":"1506.05216v1","submitted":"2015-06-17 06:25:09","updated":"2015-06-17 06:25:09","title":"The k-NN algorithm for compositional data: a revised approach with and\n without zero values present","abstract":" In compositional data, an observation is a vector with non-negative\ncomponents which sum to a constant, typically 1. Data of this type arise in\nmany areas, such as geology, archaeology, biology, economics and political\nscience among others. The goal of this paper is to extend the taxicab metric\nand a newly suggested metric for compositional data by employing a power\ntransformation. Both metrics are to be used in the k-nearest neighbours\nalgorithm regardless of the presence of zeros. Examples with real data are\nexhibited.\n","authors":"Michail Tsagris","affiliations":"","link_abstract":"http://arxiv.org/abs/1506.05216v1","link_pdf":"http://arxiv.org/pdf/1506.05216v1","link_doi":"","comment":"This manuscript will appear at the.\n http://www.jds-online.com/volume-12-number-3-july-2014","journal_ref":"Journal of Data Science, Vol 12, Number 3, July 2014","doi":"","primary_category":"stat.ME","categories":"stat.ME|62H30"} {"id":"1506.08903v7","submitted":"2015-06-30 00:00:31","updated":"2017-09-12 14:10:44","title":"A roadmap for the computation of persistent homology","abstract":" Persistent homology (PH) is a method used in topological data analysis (TDA)\nto study qualitative features of data that persist across multiple scales. It\nis robust to perturbations of input data, independent of dimensions and\ncoordinates, and provides a compact representation of the qualitative features\nof the input. The computation of PH is an open area with numerous important and\nfascinating challenges. The field of PH computation is evolving rapidly, and\nnew algorithms and software implementations are being updated and released at a\nrapid pace. The purposes of our article are to (1) introduce theory and\ncomputational methods for PH to a broad range of computational scientists and\n(2) provide benchmarks of state-of-the-art implementations for the computation\nof PH. We give a friendly introduction to PH, navigate the pipeline for the\ncomputation of PH with an eye towards applications, and use a range of\nsynthetic and real-world data sets to evaluate currently available open-source\nimplementations for the computation of PH. Based on our benchmarking, we\nindicate which algorithms and implementations are best suited to different\ntypes of data sets. In an accompanying tutorial, we provide guidelines for the\ncomputation of PH. We make publicly available all scripts that we wrote for the\ntutorial, and we make available the processed version of the data sets used in\nthe benchmarking.\n","authors":"Nina Otter|Mason A. Porter|Ulrike Tillmann|Peter Grindrod|Heather A. Harrington","affiliations":"","link_abstract":"http://arxiv.org/abs/1506.08903v7","link_pdf":"http://arxiv.org/pdf/1506.08903v7","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-017-0109-5","comment":"Final version; minor changes throughout, added a section to the\n tutorial","journal_ref":"EPJ Data Science 2017 6:17, Springer Nature","doi":"10.1140/epjds/s13688-017-0109-5","primary_category":"math.AT","categories":"math.AT|cs.CG|physics.data-an|q-bio.QM"} {"id":"1507.00333v3","submitted":"2015-06-30 20:47:34","updated":"2016-05-06 10:35:36","title":"Notes on Low-rank Matrix Factorization","abstract":" Low-rank matrix factorization (MF) is an important technique in data science.\nThe key idea of MF is that there exists latent structures in the data, by\nuncovering which we could obtain a compressed representation of the data. By\nfactorizing an original matrix to low-rank matrices, MF provides a unified\nmethod for dimension reduction, clustering, and matrix completion. In this\narticle we review several important variants of MF, including: Basic MF,\nNon-negative MF, Orthogonal non-negative MF. As can be told from their names,\nnon-negative MF and orthogonal non-negative MF are variants of basic MF with\nnon-negativity and/or orthogonality constraints. Such constraints are useful in\nspecific senarios. In the first part of this article, we introduce, for each of\nthese models, the application scenarios, the distinctive properties, and the\noptimizing method. By properly adapting MF, we can go beyond the problem of\nclustering and matrix completion. In the second part of this article, we will\nextend MF to sparse matrix compeletion, enhance matrix compeletion using\nvarious regularization methods, and make use of MF for (semi-)supervised\nlearning by introducing latent space reinforcement and transformation. We will\nsee that MF is not only a useful model but also as a flexible framework that is\napplicable for various prediction problems.\n","authors":"Yuan Lu|Jie Yang","affiliations":"","link_abstract":"http://arxiv.org/abs/1507.00333v3","link_pdf":"http://arxiv.org/pdf/1507.00333v3","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.NA","categories":"cs.NA|cs.IR|cs.LG"} {"id":"1507.06515v1","submitted":"2015-07-23 14:34:55","updated":"2015-07-23 14:34:55","title":"Complex networks and public funding: the case of the 2007-2013 Italian\n program","abstract":" In this paper we apply techniques of complex network analysis to data sources\nrepresenting public funding programs and discuss the importance of the\nconsidered indicators for program evaluation. Starting from the Open Data\nrepository of the 2007-2013 Italian Program Programma Operativo Nazionale\n'Ricerca e Competitivit\\`a' (PON R&C), we build a set of data models and\nperform network analysis over them. We discuss the obtained experimental\nresults outlining interesting new perspectives that emerge from the application\nof the proposed methods to the socio-economical evaluation of funded programs.\n","authors":"Stefano Nicotri|Eufemia Tinelli|Nicola Amoroso|Elena Garuccio|Roberto Bellotti","affiliations":"","link_abstract":"http://arxiv.org/abs/1507.06515v1","link_pdf":"http://arxiv.org/pdf/1507.06515v1","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-015-0047-z","comment":"22 pages, 9 figures","journal_ref":"EPJ Data Science 2015, 4:8","doi":"10.1140/epjds/s13688-015-0047-z","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.SI|physics.data-an"} {"id":"1508.00019v1","submitted":"2015-07-31 20:21:38","updated":"2015-07-31 20:21:38","title":"A Minimal Architecture for General Cognition","abstract":" A minimalistic cognitive architecture called MANIC is presented. The MANIC\narchitecture requires only three function approximating models, and one state\nmachine. Even with so few major components, it is theoretically sufficient to\nachieve functional equivalence with all other cognitive architectures, and can\nbe practically trained. Instead of seeking to transfer architectural\ninspiration from biology into artificial intelligence, MANIC seeks to minimize\nnovelty and follow the most well-established constructs that have evolved\nwithin various sub-fields of data science. From this perspective, MANIC offers\nan alternate approach to a long-standing objective of artificial intelligence.\nThis paper provides a theoretical analysis of the MANIC architecture.\n","authors":"Michael S. Gashler|Zachariah Kindle|Michael R. Smith","affiliations":"","link_abstract":"http://arxiv.org/abs/1508.00019v1","link_pdf":"http://arxiv.org/pdf/1508.00019v1","link_doi":"","comment":"8 pages, 8 figures, conference, Proceedings of the 2015 International\n Joint Conference on Neural Networks","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI"} {"id":"1508.01278v3","submitted":"2015-08-06 04:22:21","updated":"2015-08-16 17:19:39","title":"Teaching Statistics at Google Scale","abstract":" Modern data and applications pose very different challenges from those of the\n1950s or even the 1980s. Students contemplating a career in statistics or data\nscience need to have the tools to tackle problems involving massive,\nheavy-tailed data, often interacting with live, complex systems. However,\ndespite the deepening connections between engineering and modern data science,\nwe argue that training in classical statistical concepts plays a central role\nin preparing students to solve Google-scale problems. To this end, we present\nthree industrial applications where significant modern data challenges were\novercome by statistical thinking.\n","authors":"Nicholas Chamandy|Omkar Muralidharan|Stefan Wager","affiliations":"","link_abstract":"http://arxiv.org/abs/1508.01278v3","link_pdf":"http://arxiv.org/pdf/1508.01278v3","link_doi":"","comment":"To appear in The American Statistician","journal_ref":"","doi":"","primary_category":"stat.AP","categories":"stat.AP"} {"id":"1508.01412v1","submitted":"2015-08-06 14:16:31","updated":"2015-08-06 14:16:31","title":"Enhanced Usability of Managing Workflows in an Industrial Data Gateway","abstract":" The Grid and Cloud User Support Environment (gUSE) enables users convenient\nand easy access to grid and cloud infrastructures by providing a general\npurpose, workflow-oriented graphical user interface to create and run workflows\non various Distributed Computing Infrastructures (DCIs). Its arrangements for\ncreating and modifying existing workflows are, however, non-intuitive and\ncumbersome due to the technologies and architecture employed by gUSE. In this\npaper, we outline the first integrated web-based workflow editor for gUSE with\nthe aim of improving the user experience for those with industrial data\nworkflows and the wider gUSE community. We report initial assessments of the\neditor's utility based on users' feedback. We argue that combining access to\ndiverse scalable resources with improved workflow creation tools is important\nfor all big data applications and research infrastructures.\n","authors":"Gary A. McGilvary|Malcolm Atkinson|Sandra Gesing|Alvaro Aguilera|Richard Grunzke|Eva Sciacca","affiliations":"","link_abstract":"http://arxiv.org/abs/1508.01412v1","link_pdf":"http://arxiv.org/pdf/1508.01412v1","link_doi":"","comment":"Proceedings of the 1st International Workshop on Interoperable\n Infrastructures for Interdisciplinary Big Data Sciences, 2015","journal_ref":"","doi":"","primary_category":"cs.DC","categories":"cs.DC|cs.SE"} {"id":"1508.02387v1","submitted":"2015-08-10 19:45:48","updated":"2015-08-10 19:45:48","title":"Indonesia embraces the Data Science","abstract":" The information era is the time when information is not only largely\ngenerated, but also vastly processed in order to extract and generated more\ninformation. The complex nature of modern living is represented by the various\nkind of data. Data can be in the forms of signals, images, texts, or manifolds\nresembling the horizon of observation. The task of the emerging data sciences\nare to extract information from the data, for people gain new insights of the\ncomplex world. The insights may came from the new way of the data\nrepresentation, be it a visualizations, mapping, or other. The insights may\nalso come from the implementation of mathematical analysis and or computational\nprocessing giving new insights of what the states of the nature represented by\nthe data. Both ways implement the methodologies reducing the dimensionality of\nthe data. The relations between the two functions, representation and analysis\nare the heart of how information in data is transformed mathematically and\ncomputationally into new information. The paper discusses some practices, along\nwith various data coming from the social life in Indonesia to gain new insights\nabout Indonesia in the emerging data sciences. The data sciences in Indonesia\nhas made Indonesian Data Cartograms, Indonesian Celebrity Sentiment Mapping,\nEthno-Clustering Maps, social media community detection, and a lot more to\ncome, become possible. All of these are depicted as the exemplifications on how\nData Science has become integral part of the technology bringing data closer to\npeople.\n","authors":"Hokky Situngkir","affiliations":"","link_abstract":"http://arxiv.org/abs/1508.02387v1","link_pdf":"http://arxiv.org/pdf/1508.02387v1","link_doi":"","comment":"Paper presented in South East Asian Mathematical Society (SEAMS) 7th\n Conference, 10 pages, 7 figures","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1508.02428v1","submitted":"2015-08-10 21:15:43","updated":"2015-08-10 21:15:43","title":"FactorBase: SQL for Learning A Multi-Relational Graphical Model","abstract":" We describe FactorBase, a new SQL-based framework that leverages a relational\ndatabase management system to support multi-relational model discovery. A\nmulti-relational statistical model provides an integrated analysis of the\nheterogeneous and interdependent data resources in the database. We adopt the\nBayesStore design philosophy: statistical models are stored and managed as\nfirst-class citizens inside a database. Whereas previous systems like\nBayesStore support multi-relational inference, FactorBase supports\nmulti-relational learning. A case study on six benchmark databases evaluates\nhow our system supports a challenging machine learning application, namely\nlearning a first-order Bayesian network model for an entire database. Model\nlearning in this setting has to examine a large number of potential statistical\nassociations across data tables. Our implementation shows how the SQL\nconstructs in FactorBase facilitate the fast, modular, and reliable development\nof highly scalable model learning systems.\n","authors":"Oliver Schulte|Zhensong Qian","affiliations":"","link_abstract":"http://arxiv.org/abs/1508.02428v1","link_pdf":"http://arxiv.org/pdf/1508.02428v1","link_doi":"","comment":"14 pages, 10 figures, 10 tables, Published on 2015 IEEE International\n Conference on Data Science and Advanced Analytics (IEEE DSAA'2015), Oct\n 19-21, 2015, Paris, France","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB|cs.LG|H.2.8; H.2.4"} {"id":"1508.06228v3","submitted":"2015-08-23 11:30:11","updated":"2015-09-08 15:43:44","title":"Statistical look at reasons of involvement in wars","abstract":" The Correlates of War project scrupulously collects information about\ndisputes between the countries over a long historical period together with\nother data relevant to the character and reasons of international conflicts.\nUsing methods of modern Data Science implemented in the R statistical software,\nwe investigate the datasets from the project. We study political, economic, and\nreligious factors with respect to the emergence of conflicts and wars between\nthe countries. The results obtained lead to certain conclusions about variances\nand causalities between the factors considered. Some unpredictable features are\npresented.\n","authors":"Igor Mackarov","affiliations":"","link_abstract":"http://arxiv.org/abs/1508.06228v3","link_pdf":"http://arxiv.org/pdf/1508.06228v3","link_doi":"","comment":"Keywords: Correlates of War, Variance, Two Factorial ANOVA,\n Normality, R","journal_ref":"","doi":"","primary_category":"stat.AP","categories":"stat.AP"} {"id":"1509.03045v2","submitted":"2015-09-10 07:45:52","updated":"2016-10-12 07:13:47","title":"Empirical Big Data Research: A Systematic Literature Mapping","abstract":" Background: Big Data is a relatively new field of research and technology,\nand literature reports a wide variety of concepts labeled with Big Data. The\nmaturity of a research field can be measured in the number of publications\ncontaining empirical results. In this paper we present the current status of\nempirical research in Big Data. Method: We employed a systematic mapping method\nwith which we mapped the collected research according to the labels Variety,\nVolume and Velocity. In addition, we addressed the application areas of Big\nData. Results: We found that 151 of the assessed 1778 contributions contain a\nform of empirical result and can be mapped to one or more of the 3 V's and 59\naddress an application area. Conclusions: The share of publications containing\nempirical results is well below the average compared to computer science\nresearch as a whole. In order to mature the research on Big Data, we recommend\napplying empirical methods to strengthen the confidence in the reported\nresults. Based on our trend analysis we consider Volume and Variety to be the\nmost promising uncharted area in Big Data.\n","authors":"Bjørn Magnus Mathisen|Leendert Wienhofen|Dumitru Roman","affiliations":"","link_abstract":"http://arxiv.org/abs/1509.03045v2","link_pdf":"http://arxiv.org/pdf/1509.03045v2","link_doi":"","comment":"Submitted to Springer journal Data Science and Engineering","journal_ref":"","doi":"","primary_category":"cs.DL","categories":"cs.DL|cs.CY|cs.DB"} {"id":"1509.06695v1","submitted":"2015-09-22 17:25:47","updated":"2015-09-22 17:25:47","title":"Virtualizing Lifemapper Software Infrastructure for Biodiversity\n Expedition","abstract":" One of the activities of the Pacific Rim Applications and Grid Middleware\nAssembly (PRAGMA) is fostering Virtual Biodiversity Expeditions (VBEs) by\nbringing domain scientists and cyber infrastructure specialists together as a\nteam. Over the past few years PRAGMA members have been collaborating on\nvirtualizing the Lifemapper software. Virtualization and cloud computing have\nintroduced great flexibility and efficiency into IT projects. Virtualization\nprovides application scalability, maximizes resources utilization, and creates\na more efficient, agile, and automated infrastructure. However, there are\ndownsides to the complexity inherent in these environments, including the need\nfor special techniques to deploy cluster hosts, dependence on virtual\nenvironments, and challenging application installation, management, and\nconfiguration. In this paper, we report on progress of the Lifemapper\nvirtualization framework focused on a reproducible and highly configurable\ninfrastructure capable of fast deployment. A key contribution of this work is\ndescribing the practical experience in taking a complex, clustered,\ndomain-specific, data analysis and simulation system and making it available to\noperate on a variety of system configurations. Uses of this portability range\nfrom whole cluster replication to teaching and experimentation on a single\nlaptop. System virtualization is used to practically define and make portable\nthe full application stack, including all of its complex set of supporting\nsoftware.\n","authors":"Nadya Williams|Aimee Stewart|Phil Papadopoulos","affiliations":"","link_abstract":"http://arxiv.org/abs/1509.06695v1","link_pdf":"http://arxiv.org/pdf/1509.06695v1","link_doi":"","comment":"5 pages, 5 figures, PRAGMA Workshop on International Clouds for Data\n Science (PRAGMA-ICDS 2015)","journal_ref":"","doi":"","primary_category":"cs.SE","categories":"cs.SE|D.2.11"} {"id":"1509.06991v1","submitted":"2015-09-23 14:04:37","updated":"2015-09-23 14:04:37","title":"Feasibility Evaluation of 6LoWPAN over Bluetooth Low Energy","abstract":" IPv6 over Low power Wireless Personal Area Network (6LoWPAN) is an emerging\ntechnology to enable ubiquitous IoT services. However, there are very few\nstudies of the performance evaluation on real hardware environments. This paper\ndemonstrates the feasibility of 6LoWPAN through conducting a preliminary\nperformance evaluation of a commodity hardware environment, including Bluetooth\nLow Energy (BLE) network, Raspberry Pi, and a laptop PC. Our experimental\nresults show that the power consumption of 6LoWPAN over BLE is one-tenth lower\nthan that of IP over WiFi; the performance significantly depends on the\ndistance between devices and the message size; and the communication completely\nstops when bursty traffic transfers. This observation provides our optimistic\nconclusions on the feasibility of 6LoWPAN although the maturity of\nimplementations is a remaining issue.\n","authors":"Varat Chawathaworncharoen|Vasaka Visoottiviseth|Ryousei Takano","affiliations":"","link_abstract":"http://arxiv.org/abs/1509.06991v1","link_pdf":"http://arxiv.org/pdf/1509.06991v1","link_doi":"","comment":"4 pages, PRAGMA Workshop on International Clouds for Data Science\n (PRAGMA-ICDS 2015)","journal_ref":"","doi":"","primary_category":"cs.NI","categories":"cs.NI"} {"id":"1509.07626v1","submitted":"2015-09-25 08:21:41","updated":"2015-09-25 08:21:41","title":"Interactive Museum Exhibits with Microcontrollers: A Use-Case Scenario","abstract":" The feasibility of using microcontrollers in real life applications is\nbecoming more widespread. These applications have grown from do-it-yourself\n(DIY) projects of computer enthusiasts or robotics projects to larger scale\nefforts and deployments. This project developed and deployed a prototype\napplication that allows the public to interact with features of a model and\nview videos from a first-person perspective on the train. Through testing the\nmicrocontrollers and their usage in a public setting, it was demonstrated that\ninteractive features could be implemented in model train exhibits, which are\nfeatured in traditional museum environments that lack technical infrastructure.\nSpecifically, the Arduino and Raspberry Pi provide the necessary linkages\nbetween the Internet and hardware, allowing for a greater interactive\nexperience for museum visitors. These results provide an important use-case\nscenario that cultural heritage institutions can utilize when implementing\nmicrocontrollers on a larger scale, for the purpose of increasing visitors\nexperience through greater interaction and engagement.\n","authors":"Lok Wong|Shinji Shimojo|Yuuichi Teranishi|Tomoki Yoshihisa|Jason H. Haga","affiliations":"","link_abstract":"http://arxiv.org/abs/1509.07626v1","link_pdf":"http://arxiv.org/pdf/1509.07626v1","link_doi":"","comment":"6 pages, 6 figures, PRAGMA Workshop on International Clouds for Data\n Science (PRAGMA-ICDS 2015)","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|C.3; J.5"} {"id":"1509.07804v3","submitted":"2015-09-25 17:40:16","updated":"2015-12-22 09:53:34","title":"From Statistician to Data Scientist","abstract":" According to a recent report from the European Commission, the world\ngenerates every minute 1.7 million of billions of data bytes, the equivalent of\n360,000 DVDs, and companies that build their decision-making processes by\nexploiting these data increase their productivity. The treatment and\nvalorization of massive data has consequences on the employment of graduate\nstudents in statistics. Which additional skills do students trained in\nstatistics need to acquire to become data scientists ? How to evolve training\nso that future graduates can adapt to rapid changes in this area, without\nneglecting traditional jobs and the fundamental and lasting foundation for the\ntraining? After considering the notion of big data and questioning the\nemergence of a \"new\" science: Data Science, we present the current developments\nin the training of engineers in Mathematical and Modeling at INSA Toulouse.\n","authors":"Philippe Besse|Beatrice Laurent","affiliations":"IMT, INSA Toulouse|IMT","link_abstract":"http://arxiv.org/abs/1509.07804v3","link_pdf":"http://arxiv.org/pdf/1509.07804v3","link_doi":"","comment":"in French","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT"} {"id":"1510.00292v2","submitted":"2015-10-01 15:57:16","updated":"2016-03-18 14:01:16","title":"On Classification Issues within Ensemble-Based Complex System Simulation\n Tasks","abstract":" Contemporary tasks of complex system simulation are often related to the\nissue of uncertainty management. It comes from the lack of information or\nknowledge about the simulated system as well as from restrictions of the model\nset being used. One of the powerful tools for the uncertainty management is\nensemble-based simulation, which uses variation in input or output data, model\nparameters, or available versions of models to improve the simulation\nperformance. Furthermore the system of models for complex system simulation\n(especially in case of hiring ensemble-based approach) can be considered as a\ncomplex system. As a result, the identification of the complex model's\nstructure and parameters provide additional sources of uncertainty to be\nmanaged. Within the presented work we are developing a conceptual and\ntechnological approach to manage the ensemble-based simulation taking into\naccount changing states of both simulated system and system of models within\nthe ensemble-based approach. The states of these systems are considered as a\nsubject of classification with consequent inference of better strategies for\nensemble evolution over the simulation time and ensemble aggregation. Here the\nensemble evolution enables implementation of dynamic reactive solutions which\ncan automatically conform to the changing states of both systems. The ensemble\naggregation can be considered within a scope of averaging (regression way) or\nselection (classification way, which complement the classification mentioned\nearlier) approach. The technological basis for such approach includes\nensemble-based simulation techniques using domain-specific software combined\nwithin a composite application; data science approaches for analysis of\navailable datasets (simulation data, observations, situation assessment etc.);\nand machine learning algorithms for classes identification, ensemble management\nand knowledge acquisition.\n","authors":"Sergey V. Kovalchuk|Aleksey V. Krikunov|Konstantin V. Knyazkov|Sergey S. Kosukhin|Alexander V. Boukhanovsky","affiliations":"","link_abstract":"http://arxiv.org/abs/1510.00292v2","link_pdf":"http://arxiv.org/pdf/1510.00292v2","link_doi":"http://dx.doi.org/10.1007/s00477-016-1324-5","comment":"To be presented at CCS'15 (http://www.ccs2015.org/)","journal_ref":"","doi":"10.1007/s00477-016-1324-5","primary_category":"stat.CO","categories":"stat.CO|stat.ME|stat.OT"} {"id":"1510.01932v2","submitted":"2015-10-07 13:07:04","updated":"2016-10-26 11:05:38","title":"An Experimental Study of Segregation Mechanisms","abstract":" Segregation is widespread in all realms of human society. Several influential\nstudies have argued that intolerance is not a prerequisite for a segregated\nsociety, and that segregation can arise even when people generally prefer\ndiversity. We investigated this paradox experimentally, by letting groups of\nhigh-school students play four different real-time interactive games.\nIncentives for neighbor similarity produced segregation, but incentives for\nneighbor dissimilarity and neighborhood diversity prevented it. The\nparticipants continued to move while their game scores were below optimal, but\ntheir individual moves did not consistently take them to the best alternative\nposition. These small differences between human and simulated agents produced\ndifferent segregation patterns than previously predicted, thus challenging\nconclusions about segregation arising from these models.\n","authors":"Milena Tsvetkova|Olof Nilsson|Camilla Öhman|Lovisa Sumpter|David Sumpter","affiliations":"","link_abstract":"http://arxiv.org/abs/1510.01932v2","link_pdf":"http://arxiv.org/pdf/1510.01932v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-016-0065-5","comment":"Published in EPJ Data Science","journal_ref":"EPJ Data Science (2016) 5:4","doi":"10.1140/epjds/s13688-016-0065-5","primary_category":"cs.SI","categories":"cs.SI|cs.CY"} {"id":"1510.05981v1","submitted":"2015-10-20 17:44:44","updated":"2015-10-20 17:44:44","title":"A latent shared-component generative model for real-time disease\n surveillance using Twitter data","abstract":" Exploiting the large amount of available data for addressing relevant social\nproblems has been one of the key challenges in data mining. Such efforts have\nbeen recently named \"data science for social good\" and attracted the attention\nof several researchers and institutions. We give a contribution in this\nobjective in this paper considering a difficult public health problem, the\ntimely monitoring of dengue epidemics in small geographical areas. We develop a\ngenerative simple yet effective model to connect the fluctuations of disease\ncases and disease-related Twitter posts. We considered a hidden Markov process\ndriving both, the fluctuations in dengue reported cases and the tweets issued\nin each region. We add a stable but random source of tweets to represent the\nposts when no disease cases are recorded. The model is learned through a Markov\nchain Monte Carlo algorithm that produces the posterior distribution of the\nrelevant parameters. Using data from a significant number of large Brazilian\ntowns, we demonstrate empirically that our model is able to predict well the\nnext weeks of the disease counts using the tweets and disease cases jointly.\n","authors":"Roberto C. S. N. P. Souza|Denise E. F de Brito|Renato M. Assunção|Wagner Meira Jr","affiliations":"","link_abstract":"http://arxiv.org/abs/1510.05981v1","link_pdf":"http://arxiv.org/pdf/1510.05981v1","link_doi":"","comment":"Appears in 2nd ACM SIGKDD Workshop on Connected Health at Big Data\n Era (BigCHat)","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI|stat.ML"} {"id":"1510.07172v1","submitted":"2015-10-24 18:26:26","updated":"2015-10-24 18:26:26","title":"Object oriented models vs. data analysis -- is this the right\n alternative?","abstract":" We review the role of mathematics from a historical and a conceptual\nperspective in the light of modern data science.\n","authors":"Jürgen Jost","affiliations":"","link_abstract":"http://arxiv.org/abs/1510.07172v1","link_pdf":"http://arxiv.org/pdf/1510.07172v1","link_doi":"","comment":"This paper was developed within the ZIF project \"Mathematics as a\n Tool\", and it will appear in a special volume recording the results achieved\n in that project","journal_ref":"","doi":"","primary_category":"math.HO","categories":"math.HO"} {"id":"1511.04134v1","submitted":"2015-11-13 01:13:48","updated":"2015-11-13 01:13:48","title":"Whom Should We Sense in \"Social Sensing\" -- Analyzing Which Users Work\n Best for Social Media Now-Casting","abstract":" Given the ever increasing amount of publicly available social media data,\nthere is growing interest in using online data to study and quantify phenomena\nin the offline \"real\" world. As social media data can be obtained in near\nreal-time and at low cost, it is often used for \"now-casting\" indices such as\nlevels of flu activity or unemployment. The term \"social sensing\" is often used\nin this context to describe the idea that users act as \"sensors\", publicly\nreporting their health status or job losses. Sensor activity during a time\nperiod is then typically aggregated in a \"one tweet, one vote\" fashion by\nsimply counting. At the same time, researchers readily admit that social media\nusers are not a perfect representation of the actual population. Additionally,\nusers differ in the amount of details of their personal lives that they reveal.\nIntuitively, it should be possible to improve now-casting by assigning\ndifferent weights to different user groups.\n In this paper, we ask \"How does social sensing actually work?\" or, more\nprecisely, \"Whom should we sense--and whom not--for optimal results?\". We\ninvestigate how different sampling strategies affect the performance of\nnow-casting of two common offline indices: flu activity and unemployment rate.\nWe show that now-casting can be improved by 1) applying user filtering\ntechniques and 2) selecting users with complete profiles. We also find that,\nusing the right type of user groups, now-casting performance does not degrade,\neven when drastically reducing the size of the dataset. More fundamentally, we\ndescribe which type of users contribute most to the accuracy by asking if\n\"babblers are better\". We conclude the paper by providing guidance on how to\nselect better user groups for more accurate now-casting.\n","authors":"Jisun An|Ingmar Weber","affiliations":"","link_abstract":"http://arxiv.org/abs/1511.04134v1","link_pdf":"http://arxiv.org/pdf/1511.04134v1","link_doi":"","comment":"This is a pre-print of a forthcoming EPJ Data Science paper","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI|cs.CY"} {"id":"1511.05082v1","submitted":"2015-11-16 18:42:04","updated":"2015-11-16 18:42:04","title":"Topic Modeling of Behavioral Modes Using Sensor Data","abstract":" The field of Movement Ecology, like so many other fields, is experiencing a\nperiod of rapid growth in availability of data. As the volume rises,\ntraditional methods are giving way to machine learning and data science, which\nare playing an increasingly large part it turning this data into\nscience-driving insights. One rich and interesting source is the bio-logger.\nThese small electronic wearable devices are attached to animals free to roam in\ntheir natural habitats, and report back readings from multiple sensors,\nincluding GPS and accelerometer bursts. A common use of accelerometer data is\nfor supervised learning of behavioral modes. However, we need unsupervised\nanalysis tools as well, in order to overcome the inherent difficulties of\nobtaining a labeled dataset, which in some cases is either infeasible or does\nnot successfully encompass the full repertoire of behavioral modes of interest.\nHere we present a matrix factorization based topic-model method for\naccelerometer bursts, derived using a linear mixture property of patch\nfeatures. Our method is validated via comparison to a labeled dataset, and is\nfurther compared to standard clustering algorithms.\n","authors":"Yehezkel S. Resheff|Shay Rotics|Ran Nathan|Daphna Weinshall","affiliations":"","link_abstract":"http://arxiv.org/abs/1511.05082v1","link_pdf":"http://arxiv.org/pdf/1511.05082v1","link_doi":"","comment":"Invited Extended version of a paper \\cite{resheffmatrix} presented at\n the international conference \\textit{Data Science and Advanced Analytics},\n Paris, France, 19-21 OCtober 2015","journal_ref":"International Journal of Data Science and Analytics 1.1 (2016):\n 51-60","doi":"","primary_category":"cs.LG","categories":"cs.LG"} {"id":"1511.07643v1","submitted":"2015-11-24 10:57:07","updated":"2015-11-24 10:57:07","title":"Homophily and missing links in citation networks","abstract":" Citation networks have been widely used to study the evolution of science\nthrough the lenses of the underlying patterns of knowledge flows among academic\npapers, authors, research sub-fields, and scientific journals. Here we focus on\ncitation networks to cast light on the salience of homophily, namely the\nprinciple that similarity breeds connection, for knowledge transfer between\npapers. To this end, we assess the degree to which citations tend to occur\nbetween papers that are concerned with seemingly related topics or research\nproblems. Drawing on a large data set of articles published in the journals of\nthe American Physical Society between 1893 and 2009, we propose a novel method\nfor measuring the similarity between articles through the statistical\nvalidation of the overlap between their bibliographies. Results suggest that\nthe probability of a citation made by one article to another is indeed an\nincreasing function of the similarity between the two articles. Our study also\nenables us to uncover missing citations between pairs of highly related\narticles, and may thus help identify barriers to effective knowledge flows. By\nquantifying the proportion of missing citations, we conduct a comparative\nassessment of distinct journals and research sub-fields in terms of their\nability to facilitate or impede the dissemination of knowledge. Findings\nindicate that knowledge transfer seems to be more effectively facilitated by\njournals of wide visibility, such as Physical Review Letters, than by\nlower-impact ones. Our study has important implications for authors, editors\nand reviewers of scientific journals, as well as public preprint repositories,\nas it provides a procedure for recommending relevant yet missing references and\nproperly integrating bibliographies of papers.\n","authors":"Valerio Ciotti|Moreno Bonaventura|Vincenzo Nicosia|Pietro Panzarasa|Vito Latora","affiliations":"","link_abstract":"http://arxiv.org/abs/1511.07643v1","link_pdf":"http://arxiv.org/pdf/1511.07643v1","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-016-0068-2","comment":"11 pages, 4 figures, 1 table","journal_ref":"EPJ Data Science 5:7 doi:10.1140/epjds/s13688-016-0068-2 (2016)","doi":"10.1140/epjds/s13688-016-0068-2","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.DL|cs.IR|cs.SI"} {"id":"1511.07916v1","submitted":"2015-11-24 23:23:13","updated":"2015-11-24 23:23:13","title":"Natural Language Understanding with Distributed Representation","abstract":" This is a lecture note for the course DS-GA 3001 at the Center for Data Science ,\nNew York University in Fall, 2015. As the name of the course suggests, this\nlecture note introduces readers to a neural network based approach to natural\nlanguage understanding/processing. In order to make it as self-contained as\npossible, I spend much time on describing basics of machine learning and neural\nnetworks, only after which how they are used for natural languages is\nintroduced. On the language front, I almost solely focus on language modelling\nand machine translation, two of which I personally find most fascinating and\nmost fundamental to natural language understanding.\n","authors":"Kyunghyun Cho","affiliations":"","link_abstract":"http://arxiv.org/abs/1511.07916v1","link_pdf":"http://arxiv.org/pdf/1511.07916v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CL","categories":"cs.CL|stat.ML"} {"id":"1512.04456v1","submitted":"2015-12-14 18:45:45","updated":"2015-12-14 18:45:45","title":"Teaching the Foundations of Data Science: An Interdisciplinary Approach","abstract":" The astronomical growth of data has necessitated the need for educating\nwell-qualified data scientists to derive deep insights from large and complex\ndata sets generated by organizations. In this paper, we present our\ninterdisciplinary approach and experiences in teaching a Data Science course,\nthe first of its kind offered at the Wright State University. Two faculty\nmembers from the Management Information Systems (MIS) and Computer Science (CS)\ndepartments designed and co-taught the course with perspectives from their\nprevious research and teaching experiences. Students in the class had mix\nbackgrounds with mainly MIS and CS majors. Students' learning outcomes and post\ncourse survey responses suggested that the course delivered a broad overview of\ndata science as desired, and that students worked synergistically with those of\ndifferent majors in collaborative lab assignments and in a semester long\nproject. The interdisciplinary pedagogy helped build collaboration and create\nsatisfaction among learners.\n","authors":"Daniel Asamoah|Derek Doran|Shu Schiller","affiliations":"","link_abstract":"http://arxiv.org/abs/1512.04456v1","link_pdf":"http://arxiv.org/pdf/1512.04456v1","link_doi":"","comment":"Presented at SIGDSA Business Analytics Conference 2015","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1512.04776v1","submitted":"2015-12-15 13:32:47","updated":"2015-12-15 13:32:47","title":"Predicting links in ego-networks using temporal information","abstract":" Link prediction appears as a central problem of network science, as it calls\nfor unfolding the mechanisms that govern the micro-dynamics of the network. In\nthis work, we are interested in ego-networks, that is the mere information of\ninteractions of a node to its neighbors, in the context of social\nrelationships. As the structural information is very poor, we rely on another\nsource of information to predict links among egos' neighbors: the timing of\ninteractions. We define several features to capture different kinds of temporal\ninformation and apply machine learning methods to combine these various\nfeatures and improve the quality of the prediction. We demonstrate the\nefficiency of this temporal approach on a cellphone interaction dataset,\npointing out features which prove themselves to perform well in this context,\nin particular the temporal profile of interactions and elapsed time between\ncontacts.\n","authors":"Lionel Tabourier|Anne-Sophie Libert|Renaud Lambiotte","affiliations":"","link_abstract":"http://arxiv.org/abs/1512.04776v1","link_pdf":"http://arxiv.org/pdf/1512.04776v1","link_doi":"","comment":"submitted to EPJ Data Science","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI|physics.soc-ph"} {"id":"1512.05979v1","submitted":"2015-12-18 14:59:26","updated":"2015-12-18 14:59:26","title":"Energy Consumption Forecasting for Smart Meters","abstract":" Earth, water, air, food, shelter and energy are essential factors required\nfor human being to survive on the planet. Among this energy plays a key role in\nour day to day living including giving lighting, cooling and heating of\nshelter, preparation of food. Due to this interdependency, energy, specifically\nelectricity, production and distribution became a high tech industry. Unlike\nother industries, the key differentiator of electricity industry is the product\nitself. It can be produced but cannot be stored for future; production and\nconsumption happen almost in near real-time. This particular peculiarity of the\nindustry is the key driver for Machine Learning and Data Science based\ninnovations in this industry. There is always a gap between the demand and\nsupply in the electricity market across the globe. To fill the gap and improve\nthe service efficiency through providing necessary supply to the market,\ncommercial as well as federal electricity companies employ forecasting\ntechniques to predict the future demand and try to meet the demand and provide\ncurtailment guidelines to optimise the electricity consumption/demand. In this\npaper the authors examine the application of Machine Learning algorithms,\nspecifically Boosted Decision Tree Regression, to the modelling and forecasting\nof energy consumption for smart meters. The data used for this exercise is\nobtained from DECC data website. Along with this data, the methodology has been\ntested in Smart Meter data obtained from EMA Singapore. This paper focuses on\nfeature engineering for time series forecasting using regression algorithms and\nderiving a methodology to create personalised electricity plans offers for\nhousehold users based on usage history.\n","authors":"Anshul Bansal|Susheel Kaushik Rompikuntla|Jaganadh Gopinadhan|Amanpreet Kaur|Zahoor Ahamed Kazi","affiliations":"","link_abstract":"http://arxiv.org/abs/1512.05979v1","link_pdf":"http://arxiv.org/pdf/1512.05979v1","link_doi":"","comment":"Presented at BAI Conference 2015 at IIM Bangalore, India","journal_ref":"","doi":"","primary_category":"cs.OH","categories":"cs.OH"} {"id":"1512.06169v2","submitted":"2015-12-19 00:28:43","updated":"2015-12-27 09:25:46","title":"Technological novelty profile and invention's future impact","abstract":" We consider inventions as novel combinations of existing technological\ncapabilities. Patent data allow us to explicitly identify such combinatorial\nprocesses in invention activities. Unconsidered in the previous research, not\nevery new combination is novel to the same extent. Some combinations are\nnaturally anticipated based on patent activities in the past or mere random\nchoices, and some appear to deviate exceptionally from existing invention\npathways. We calculate a relative likelihood that each pair of classification\ncodes is put together at random, and a deviation from the empirical observation\nso as to assess the overall novelty (or conventionality) that the patent brings\nforth at each year. An invention is considered as unconventional if a pair of\ncodes therein is unlikely to be used together given the statistics in the past.\nTemporal evolution of the distribution indicates that the patenting activities\nbecome more conventional with occasional cross-over combinations. Our analyses\nshow that patents introducing novelty on top of the conventional units would\nreceive higher citations, and hence have higher impact.\n","authors":"Daniel Kim|Daniel Burkhardt Cerigo|Hawoong Jeong|Hyejin Youn","affiliations":"","link_abstract":"http://arxiv.org/abs/1512.06169v2","link_pdf":"http://arxiv.org/pdf/1512.06169v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-016-0069-1","comment":"20 pages, 7 figures","journal_ref":"EPJ Data Science 5:8 2016","doi":"10.1140/epjds/s13688-016-0069-1","primary_category":"physics.soc-ph","categories":"physics.soc-ph|physics.data-an"} {"id":"1601.02946v5","submitted":"2016-01-12 16:32:16","updated":"2020-06-14 17:28:51","title":"Product Formalisms for Measures on Spaces with Binary Tree Structures:\n Representation, Visualization, and Multiscale Noise","abstract":" In this paper we present a theoretical foundation for a representation of a\ndata set as a measure in a very large hierarchically parametrized family of\npositive measures, whose parameters can be computed explicitly (rather than\nestimated by optimization), and illustrate its applicability to a wide range of\ndata types. The pre-processing step then consists of representing data sets as\nsimple measures. The theoretical foundation consists of a dyadic product\nformula representation lemma, a visualization theorem. We also define an\nadditive multiscale noise model which can be used to sample from dyadic\nmeasures and a more general multiplicative multiscale noise model which can be\nused to perturb continuous functions, Borel measures, and dyadic measures. The\nfirst two results are based on theorems. The representation uses the very\nsimple concept of a dyadic tree, and hence is widely applicable, easily\nunderstood, and easily computed. Since the data sample is represented as a\nmeasure, subsequent analysis can exploit statistical and measure theoretic\nconcepts and theories. Because the representation uses the very simple concept\nof a dyadic tree defined on the universe of a data set and the parameters are\nsimply and explicitly computable and easily interpretable and visualizable, we\nhope that this approach will be broadly useful to mathematicians,\nstatisticians, and computer scientists who are intrigued by or involved in data\nscience including its mathematical foundations.\n","authors":"Devasis Bassu|Peter W. Jones|Linda Ness|David Shallcross","affiliations":"","link_abstract":"http://arxiv.org/abs/1601.02946v5","link_pdf":"http://arxiv.org/pdf/1601.02946v5","link_doi":"","comment":"submitted","journal_ref":"","doi":"","primary_category":"math.CA","categories":"math.CA|42B99"} {"id":"1601.04890v2","submitted":"2016-01-19 12:25:23","updated":"2016-03-03 03:05:37","title":"Women Through the Glass Ceiling: Gender Asymmetries in Wikipedia","abstract":" Contributing to the writing of history has never been as easy as it is today\nthanks to Wikipedia, a community-created encyclopedia that aims to document the\nworld's knowledge from a neutral point of view. Though everyone can participate\nit is well known that the editor community has a narrow diversity, with a\nmajority of white male editors. While this participatory \\emph{gender gap} has\nbeen studied extensively in the literature, this work sets out to \\emph{assess\npotential gender inequalities in Wikipedia articles} along different\ndimensions: notability, topical focus, linguistic bias, structural properties,\nand meta-data presentation.\n We find that (i) women in Wikipedia are more notable than men, which we\ninterpret as the outcome of a subtle glass ceiling effect; (ii) family-,\ngender-, and relationship-related topics are more present in biographies about\nwomen; (iii) linguistic bias manifests in Wikipedia since abstract terms tend\nto be used to describe positive aspects in the biographies of men and negative\naspects in the biographies of women; and (iv) there are structural differences\nin terms of meta-data and hyperlinks, which have consequences for\ninformation-seeking activities. While some differences are expected, due to\nhistorical and social contexts, other differences are attributable to Wikipedia\neditors. The implications of such differences are discussed having Wikipedia\ncontribution policies in mind. We hope that the present work will contribute to\nincreased awareness about, first, gender issues in the content of Wikipedia,\nand second, the different levels on which gender biases can manifest on the\nWeb.\n","authors":"Claudia Wagner|Eduardo Graells-Garrido|David Garcia|Filippo Menczer","affiliations":"","link_abstract":"http://arxiv.org/abs/1601.04890v2","link_pdf":"http://arxiv.org/pdf/1601.04890v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-016-0066-4","comment":"23 pages. Published in EPJ Data Science 2016 5:5","journal_ref":"","doi":"10.1140/epjds/s13688-016-0066-4","primary_category":"cs.SI","categories":"cs.SI"} {"id":"1601.06035v1","submitted":"2016-01-22 15:09:18","updated":"2016-01-22 15:09:18","title":"Recommender systems inspired by the structure of quantum theory","abstract":" Physicists use quantum models to describe the behavior of physical systems.\nQuantum models owe their success to their interpretability, to their relation\nto probabilistic models (quantization of classical models) and to their high\npredictive power. Beyond physics, these properties are valuable in general data\nscience. This motivates the use of quantum models to analyze general\nnonphysical datasets. Here we provide both empirical and theoretical insights\ninto the application of quantum models in data science. In the theoretical part\nof this paper, we firstly show that quantum models can be exponentially more\nefficient than probabilistic models because there exist datasets that admit\nlow-dimensional quantum models and only exponentially high-dimensional\nprobabilistic models. Secondly, we explain in what sense quantum models realize\na useful relaxation of compressed probabilistic models. Thirdly, we show that\nsparse datasets admit low-dimensional quantum models and finally, we introduce\na method to compute hierarchical orderings of properties of users (e.g.,\npersonality traits) and items (e.g., genres of movies). In the empirical part\nof the paper, we evaluate quantum models in item recommendation and observe\nthat the predictive power of quantum-inspired recommender systems can compete\nwith state-of-the-art recommender systems like SVD++ and PureSVD. Furthermore,\nwe make use of the interpretability of quantum models by computing hierarchical\norderings of properties of users and items. This work establishes a connection\nbetween data science (item recommendation), information theory (communication\ncomplexity), mathematical programming (positive semidefinite factorizations)\nand physics (quantum models).\n","authors":"Cyril Stark","affiliations":"","link_abstract":"http://arxiv.org/abs/1601.06035v1","link_pdf":"http://arxiv.org/pdf/1601.06035v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.IT|math.IT|math.OC|quant-ph|stat.ML"} {"id":"1601.06128v1","submitted":"2016-01-22 19:46:33","updated":"2016-01-22 19:46:33","title":"RioBusData: Outlier Detection in Bus Routes of Rio de Janeiro","abstract":" Buses are the primary means of public transportation in the city of Rio de\nJaneiro, carrying around 100 million passengers every month. Recently,\nreal-time GPS coordinates of all operating public buses has been made publicly\navailable - roughly 1 million GPS entries each captured each day. In an initial\nstudy, we observed that a substantial number of buses follow trajectories that\ndo not follow the expected behavior. In this paper, we present RioBusData, a\ntool that helps users identify and explore, through different visualizations,\nthe behavior of outlier trajectories. We describe how the system automatically\ndetects these outliers using a Convolutional Neural Network (CNN) and we also\ndiscuss a series of case studies which show how RioBusData helps users better\nunderstand not only the flow and service of outlier buses but also the bus\nsystem as a whole.\n","authors":"Aline Bessa|Fernando de Mesentier Silva|Rodrigo Frassetto Nogueira|Enrico Bertini|Juliana Freire","affiliations":"","link_abstract":"http://arxiv.org/abs/1601.06128v1","link_pdf":"http://arxiv.org/pdf/1601.06128v1","link_doi":"","comment":"In Symposium on Visualization in Data Science (VDS at IEEE VIS),\n Chicago, Illinois, US, 2015","journal_ref":"","doi":"","primary_category":"cs.HC","categories":"cs.HC"} {"id":"1601.07741v2","submitted":"2016-01-28 12:48:20","updated":"2016-03-25 09:01:54","title":"Touristic site attractiveness seen through Twitter","abstract":" Tourism is becoming a significant contributor to medium and long range\ntravels in an increasingly globalized world. Leisure traveling has an important\nimpact on the local and global economy as well as on the environment. The study\nof touristic trips is thus raising a considerable interest. In this work, we\napply a method to assess the attractiveness of 20 of the most popular touristic\nsites worldwide using geolocated tweets as a proxy for human mobility. We first\nrank the touristic sites based on the spatial distribution of the visitors'\nplace of residence. The Taj Mahal, the Pisa Tower and the Eiffel Tower appear\nconsistently in the top 5 in these rankings. We then pass to a coarser scale\nand classify the travelers by country of residence. Touristic site's visiting\nfigures are then studied by country of residence showing that the Eiffel Tower,\nTimes Square and the London Tower welcome the majority of the visitors of each\ncountry. Finally, we build a network linking sites whenever a user has been\ndetected in more than one site. This allow us to unveil relations between\ntouristic sites and find which ones are more tightly interconnected.\n","authors":"Aleix Bassolas|Maxime Lenormand|Antònia Tugores|Bruno Gonçalves|José J. Ramasco","affiliations":"","link_abstract":"http://arxiv.org/abs/1601.07741v2","link_pdf":"http://arxiv.org/pdf/1601.07741v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-016-0073-5","comment":"8 pages and 5 figures","journal_ref":"EPJ Data Science 5, 12 (2016)","doi":"10.1140/epjds/s13688-016-0073-5","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.SI"} {"id":"1601.07925v1","submitted":"2016-01-28 21:45:55","updated":"2016-01-28 21:45:55","title":"Automating biomedical data science through tree-based pipeline\n optimization","abstract":" Over the past decade, data science and machine learning has grown from a\nmysterious art form to a staple tool across a variety of fields in academia,\nbusiness, and government. In this paper, we introduce the concept of tree-based\npipeline optimization for automating one of the most tedious parts of machine\nlearning---pipeline design. We implement a Tree-based Pipeline Optimization\nTool (TPOT) and demonstrate its effectiveness on a series of simulated and\nreal-world genetic data sets. In particular, we show that TPOT can build\nmachine learning pipelines that achieve competitive classification accuracy and\ndiscover novel pipeline operators---such as synthetic feature\nconstructors---that significantly improve classification accuracy on these data\nsets. We also highlight the current challenges to pipeline optimization, such\nas the tendency to produce pipelines that overfit the data, and suggest future\nresearch paths to overcome these challenges. As such, this work represents an\nearly step toward fully automating machine learning pipeline design.\n","authors":"Randal S. Olson|Ryan J. Urbanowicz|Peter C. Andrews|Nicole A. Lavender|La Creis Kidd|Jason H. Moore","affiliations":"","link_abstract":"http://arxiv.org/abs/1601.07925v1","link_pdf":"http://arxiv.org/pdf/1601.07925v1","link_doi":"","comment":"16 pages, 5 figures, to appear in EvoBIO 2016 proceedings","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.NE"} {"id":"1602.08451v1","submitted":"2016-02-06 13:49:16","updated":"2016-02-06 13:49:16","title":"Ground truth? Concept-based communities versus the external\n classification of physics manuscripts","abstract":" Community detection techniques are widely used to infer hidden structures\nwithin interconnected systems. Despite demonstrating high accuracy on\nbenchmarks, they reproduce the external classification for many real-world\nsystems with a significant level of discrepancy. A widely accepted reason\nbehind such outcome is the unavoidable loss of non-topological information\n(such as node attributes) encountered when the original complex system is\nrepresented as a network. In this article we emphasize that the observed\ndiscrepancies may also be caused by a different reason: the external\nclassification itself. For this end we use scientific publication data which i)\nexhibit a well defined modular structure and ii) hold an expert-made\nclassification of research articles. Having represented the articles and the\nextracted scientific concepts both as a bipartite network and as its unipartite\nprojection, we applied modularity optimization to uncover the inner thematic\nstructure. The resulting clusters are shown to partly reflect the author-made\nclassification, although some significant discrepancies are observed. A\ndetailed analysis of these discrepancies shows that they carry essential\ninformation about the system, mainly related to the use of similar techniques\nand methods across different (sub)disciplines, that is otherwise omitted when\nonly the external classification is considered.\n","authors":"Vasyl Palchykov|Valerio Gemmetto|Alexey Boyarsky|Diego Garlaschelli","affiliations":"","link_abstract":"http://arxiv.org/abs/1602.08451v1","link_pdf":"http://arxiv.org/pdf/1602.08451v1","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-016-0090-4","comment":"15 pages, 2 figures","journal_ref":"EPJ Data Science 2016 5:28","doi":"10.1140/epjds/s13688-016-0090-4","primary_category":"cs.DL","categories":"cs.DL|physics.soc-ph"} {"id":"1602.03202v1","submitted":"2016-02-07 04:57:17","updated":"2016-02-07 04:57:17","title":"Market Model and Optimal Pricing Scheme of Big Data and Internet of\n Things (IoT)","abstract":" Big data has been emerging as a new approach in utilizing large datasets to\noptimize complex system operations. Big data is fueled with Internet-of-Things\n(IoT) services that generate immense sensory data from numerous sensors and\ndevices. While most current research focus of big data is on machine learning\nand resource management design, the economic modeling and analysis have been\nlargely overlooked. This paper thus investigates the big data market model and\noptimal pricing scheme. We first study the utility of data from the data\nscience perspective, i.e., using the machine learning methods. We then\nintroduce the market model and develop an optimal pricing scheme afterward. The\ncase study shows clearly the suitability of the proposed data utility\nfunctions. The numerical examples demonstrate that big data and IoT service\nprovider can achieve the maximum profit through the proposed market model.\n","authors":"Dusit Niyato|Mohammad Abu Alsheikh|Ping Wang|Dong In Kim|Zhu Han","affiliations":"","link_abstract":"http://arxiv.org/abs/1602.03202v1","link_pdf":"http://arxiv.org/pdf/1602.03202v1","link_doi":"http://dx.doi.org/10.1109/ICC.2016.7510922","comment":"","journal_ref":"","doi":"10.1109/ICC.2016.7510922","primary_category":"cs.GT","categories":"cs.GT"} {"id":"1602.05142v1","submitted":"2016-02-13 15:43:19","updated":"2016-02-13 15:43:19","title":"Data Science at Udemy: Agile Experimentation with Algorithms","abstract":" In this paper, we describe the data science framework at Udemy, which\ncurrently supports the recommender and search system. We explain the\nmotivations behind the framework and review the approach, which allows multiple\nindividual data scientists to all become 'full stack', taking control of their\nown destinies from the exploration and research phase, through algorithm\ndevelopment, experiment setup, and deep experiment analytics. We describe\nalgorithms tested and deployed in 2015, as well as some key insights obtained\nfrom experiments leading to the launch of the new recommender system at Udemy.\nFinally, we outline the current areas of research, which include search,\npersonalization, and algorithmic topic generation.\n","authors":"Larry Wai","affiliations":"","link_abstract":"http://arxiv.org/abs/1602.05142v1","link_pdf":"http://arxiv.org/pdf/1602.05142v1","link_doi":"","comment":"6 pages, submitted to KDD 2016","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|68U35|K.3.1"} {"id":"1602.07614v1","submitted":"2016-02-15 16:33:39","updated":"2016-02-15 16:33:39","title":"A Model of Selective Advantage for the Efficient Inference of Cancer\n Clonal Evolution","abstract":" Recently, there has been a resurgence of interest in rigorous algorithms for\nthe inference of cancer progression from genomic data. The motivations are\nmanifold: (i) growing NGS and single cell data from cancer patients, (ii) need\nfor novel Data Science and Machine Learning algorithms to infer models of\ncancer progression, and (iii) a desire to understand the temporal and\nheterogeneous structure of tumor to tame its progression by efficacious\ntherapeutic intervention. This thesis presents a multi-disciplinary effort to\nmodel tumor progression involving successive accumulation of genetic\nalterations, each resulting populations manifesting themselves in a cancer\nphenotype. The framework presented in this work along with algorithms derived\nfrom it, represents a novel approach for inferring cancer progression, whose\naccuracy and convergence rates surpass the existing techniques. The approach\nderives its power from several fields including algorithms in machine learning,\ntheory of causality and cancer biology. Furthermore, a modular pipeline to\nextract ensemble-level progression models from sequenced cancer genomes is\nproposed. The pipeline combines state-of-the-art techniques for sample\nstratification, driver selection, identification of fitness-equivalent\nexclusive alterations and progression model inference. Furthermore, the results\nare validated by synthetic data with realistic generative models, and\nempirically interpreted in the context of real cancer datasets; in the later\ncase, biologically significant conclusions are also highlighted. Specifically,\nit demonstrates the pipeline's ability to reproduce much of the knowledge on\ncolorectal cancer, as well as to suggest novel hypotheses. Lastly, it also\nproves that the proposed framework can be applied to reconstruct the\nevolutionary history of cancer clones in single patients, as illustrated by an\nexample from clear cell renal carcinomas.\n","authors":"Daniele Ramazzotti","affiliations":"","link_abstract":"http://arxiv.org/abs/1602.07614v1","link_pdf":"http://arxiv.org/pdf/1602.07614v1","link_doi":"","comment":"Doctoral thesis, University of Milan","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG"} {"id":"1602.06468v3","submitted":"2016-02-20 21:56:49","updated":"2016-06-24 01:28:23","title":"FLASH: Fast Bayesian Optimization for Data Analytic Pipelines","abstract":" Modern data science relies on data analytic pipelines to organize\ninterdependent computational steps. Such analytic pipelines often involve\ndifferent algorithms across multiple steps, each with its own hyperparameters.\nTo achieve the best performance, it is often critical to select optimal\nalgorithms and to set appropriate hyperparameters, which requires large\ncomputational efforts. Bayesian optimization provides a principled way for\nsearching optimal hyperparameters for a single algorithm. However, many\nchallenges remain in solving pipeline optimization problems with\nhigh-dimensional and highly conditional search space. In this work, we propose\nFast LineAr SearcH (FLASH), an efficient method for tuning analytic pipelines.\nFLASH is a two-layer Bayesian optimization framework, which firstly uses a\nparametric model to select promising algorithms, then computes a nonparametric\nmodel to fine-tune hyperparameters of the promising algorithms. FLASH also\nincludes an effective caching algorithm which can further accelerate the search\nprocess. Extensive experiments on a number of benchmark datasets have\ndemonstrated that FLASH significantly outperforms previous state-of-the-art\nmethods in both search speed and accuracy. Using 50% of the time budget, FLASH\nachieves up to 20% improvement on test error rate compared to the baselines.\nFLASH also yields state-of-the-art performance on a real-world application for\nhealthcare predictive modeling.\n","authors":"Yuyu Zhang|Mohammad Taha Bahadori|Hang Su|Jimeng Sun","affiliations":"","link_abstract":"http://arxiv.org/abs/1602.06468v3","link_pdf":"http://arxiv.org/pdf/1602.06468v3","link_doi":"","comment":"21 pages, KDD 2016","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG"} {"id":"1602.08021v1","submitted":"2016-02-25 18:14:40","updated":"2016-02-25 18:14:40","title":"Stochastic forward-backward and primal-dual approximation algorithms\n with application to online image restoration","abstract":" Stochastic approximation techniques have been used in various contexts in\ndata science. We propose a stochastic version of the forward-backward algorithm\nfor minimizing the sum of two convex functions, one of which is not necessarily\nsmooth. Our framework can handle stochastic approximations of the gradient of\nthe smooth function and allows for stochastic errors in the evaluation of the\nproximity operator of the nonsmooth function. The almost sure convergence of\nthe iterates generated by the algorithm to a minimizer is established under\nrelatively mild assumptions. We also propose a stochastic version of a popular\nprimal-dual proximal splitting algorithm, establish its convergence, and apply\nit to an online image restoration problem.\n","authors":"Patrick L. Combettes|Jean-Christophe Pesquet","affiliations":"","link_abstract":"http://arxiv.org/abs/1602.08021v1","link_pdf":"http://arxiv.org/pdf/1602.08021v1","link_doi":"","comment":"5 Figures","journal_ref":"","doi":"","primary_category":"math.OC","categories":"math.OC|90C25, 90C15, 94A08"} {"id":"1603.01103v1","submitted":"2016-03-02 13:18:32","updated":"2016-03-02 13:18:32","title":"Regularities and Discrepancies of Credit Default Swaps: a Data Science\n approach through Benford's Law","abstract":" In this paper, we search whether the Benford's law is applicable to monitor\ndaily changes in sovereign Credit Default Swaps (CDS) quotes, which are\nacknowledged to be complex systems of economic content. This test is of\nparamount importance since the CDS of a country proxy its health and\nprobability to default, being associated to an insurance against the event of\nits default. We fit the Benford's law to the daily changes in sovereign CDS\nspreads for 13 European countries, - both inside and outside the European Union\nand European Monetary Union. Two different tenors for the sovereign CDS\ncontracts are considered: 5 yrs and 10 yrs, - the former being the reference\nand most liquid one. The time period under investigation is 2008-2015 which\nincludes the period of distress caused by the European sovereign debt crisis.\nMoreover, (i) an analysis over relevant sub-periods is carried out, (ii)\nseveral insights are provided also by implementing the tracking of the\nBenford's law over moving windows. The main test for checking the conformance\nto Benford's law is - as usual - the $\\chi^{2}$ test, whose values are\npresented and discussed for all cases. The analysis is further completed by\nelaborations based on Chebyshev's distance and Kullback and Leibler's\ndivergence. The results highlight differences by countries and tenors. In\nparticular, these results suggest that liquidity seems to be associated to\nhigher levels of distortion. Greece - representing a peculiar case - shows a\nvery different path with respect to the other European countries.\n","authors":"Marcel Ausloos|Rosella Castellano|Roy Cerqueti","affiliations":"","link_abstract":"http://arxiv.org/abs/1603.01103v1","link_pdf":"http://arxiv.org/pdf/1603.01103v1","link_doi":"http://dx.doi.org/10.1016/j.chaos.2016.03.002","comment":"20 pages, 6 tables, 1 figure, Chaos, Solitons and Fractals, 2016","journal_ref":"Chaos, Solitons & Fractals 90 (2016) 8-17","doi":"10.1016/j.chaos.2016.03.002","primary_category":"q-fin.ST","categories":"q-fin.ST|62-O7, 91G70"} {"id":"1603.02021v1","submitted":"2016-03-07 12:04:10","updated":"2016-03-07 12:04:10","title":"Machine Learning for Protein Function","abstract":" Systematic identification of protein function is a key problem in current\nbiology. Most traditional methods fail to identify functionally equivalent\nproteins if they lack similar sequences, structural data or extensive manual\nannotations. In this thesis, I focused on feature engineering and machine\nlearning methods for identifying diverse classes of proteins that share\nfunctional relatedness but little sequence or structural similarity, notably,\nNeuropeptide Precursors (NPPs).\n I aim to identify functional protein classes solely using unannotated protein\nprimary sequences from any organism. This thesis focuses on feature\nrepresentations of whole protein sequences, sequence derived engineered\nfeatures, their extraction, frameworks for their usage by machine learning (ML)\nmodels, and the application of ML models to biological tasks, focusing on high\nlevel protein functions. I implemented the ideas of feature engineering to\ndevelop a platform (called NeuroPID) that extracts meaningful features for\nclassification of overlooked NPPs. The platform allows mass discovery of new\nNPs and NPPs. It was expanded as a webserver.\n I expanded our approach towards other challenging protein classes. This is\nimplemented as a novel bioinformatics toolkit called ProFET (Protein Feature\nEngineering Toolkit). ProFET extracts hundreds of biophysical and sequence\nderived attributes, allowing the application of machine learning methods to\nproteins. ProFET was applied on many protein benchmark datasets with state of\nthe art performance. The success of ProFET applies to a wide range of\nhigh-level functions such as metagenomic analysis, subcellular localization,\nstructure and unique functional properties (e.g. thermophiles, nucleic acid\nbinding).\n These methods and frameworks represent a valuable resource for using ML and\ndata science methods on proteins.\n","authors":"Dan Ofer","affiliations":"","link_abstract":"http://arxiv.org/abs/1603.02021v1","link_pdf":"http://arxiv.org/pdf/1603.02021v1","link_doi":"","comment":"MsC Thesis","journal_ref":"","doi":"","primary_category":"q-bio.GN","categories":"q-bio.GN"} {"id":"1603.04225v1","submitted":"2016-03-14 12:08:37","updated":"2016-03-14 12:08:37","title":"Linguistic neighbourhoods: explaining cultural borders on Wikipedia\n through multilingual co-editing activity","abstract":" In this paper, we study the network of global interconnections between\nlanguage communities, based on shared co-editing interests of Wikipedia\neditors, and show that although English is discussed as a potential lingua\nfranca of the digital space, its domination disappears in the network of\nco-editing similarities, and instead local connections come to the forefront.\nOut of the hypotheses we explored, bilingualism, linguistic similarity of\nlanguages, and shared religion provide the best explanations for the similarity\nof interests between cultural communities. Population attraction and\ngeographical proximity are also significant, but much weaker factors bringing\ncommunities together. In addition, we present an approach that allows for\nextracting significant cultural borders from editing activity of Wikipedia\nusers, and comparing a set of hypotheses about the social mechanisms generating\nthese borders. Our study sheds light on how culture is reflected in the\ncollective process of archiving knowledge on Wikipedia, and demonstrates that\ncross-lingual interconnections on Wikipedia are not dominated by one powerful\nlanguage. Our findings also raise some important policy questions for the\nWikimedia Foundation.\n","authors":"Anna Samoilenko|Fariba karimi|Daniel Edler|Jérôme Kunegis|Markus Strohmaier","affiliations":"","link_abstract":"http://arxiv.org/abs/1603.04225v1","link_pdf":"http://arxiv.org/pdf/1603.04225v1","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-016-0070-8","comment":"20 pages, 5 figures, 3 tables Best poster award at the NetSciX'16 in\n Wroclaw, Poland","journal_ref":"EPJ Data Science 2016 5(9)","doi":"10.1140/epjds/s13688-016-0070-8","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.SI"} {"id":"1603.06212v1","submitted":"2016-03-20 13:32:27","updated":"2016-03-20 13:32:27","title":"Evaluation of a Tree-based Pipeline Optimization Tool for Automating\n Data Science","abstract":" As the field of data science continues to grow, there will be an\never-increasing demand for tools that make machine learning accessible to\nnon-experts. In this paper, we introduce the concept of tree-based pipeline\noptimization for automating one of the most tedious parts of machine\nlearning---pipeline design. We implement an open source Tree-based Pipeline\nOptimization Tool (TPOT) in Python and demonstrate its effectiveness on a\nseries of simulated and real-world benchmark data sets. In particular, we show\nthat TPOT can design machine learning pipelines that provide a significant\nimprovement over a basic machine learning analysis while requiring little to no\ninput nor prior knowledge from the user. We also address the tendency for TPOT\nto design overly complex pipelines by integrating Pareto optimization, which\nproduces compact pipelines without sacrificing classification accuracy. As\nsuch, this work represents an important step toward fully automating machine\nlearning pipeline design.\n","authors":"Randal S. Olson|Nathan Bartley|Ryan J. Urbanowicz|Jason H. Moore","affiliations":"","link_abstract":"http://arxiv.org/abs/1603.06212v1","link_pdf":"http://arxiv.org/pdf/1603.06212v1","link_doi":"","comment":"8 pages, 5 figures, preprint to appear in GECCO 2016, edits not yet\n made from reviewer comments","journal_ref":"","doi":"","primary_category":"cs.NE","categories":"cs.NE|cs.AI|cs.LG"} {"id":"1603.06828v2","submitted":"2016-03-22 15:24:41","updated":"2016-11-24 11:24:10","title":"Robust principal graphs for data approximation","abstract":" Revealing hidden geometry and topology in noisy data sets is a challenging\ntask. Elastic principal graph is a computationally efficient and flexible data\napproximator based on embedding a graph into the data space and minimizing the\nenergy functional penalizing the deviation of graph nodes both from data points\nand from pluri-harmonic configuration (generalization of linearity). The\nstructure of principal graph is learned from data by application of a\ntopological grammar which in the simplest case leads to the construction of\nprincipal curves or trees. In order to more efficiently cope with noise and\noutliers, here we suggest using a trimmed data approximation term to increase\nthe robustness of the method. The modification of the method that we suggest\ndoes not affect either computational efficiency or general convergence\nproperties of the original elastic graph method. The trimmed elastic energy\nfunctional remains a Lyapunov function for the optimization algorithm. On\nseveral examples of complex data distributions we demonstrate how the robust\nprincipal graphs learn the global data structure and show the advantage of\nusing the trimmed data approximation term for the construction of principal\ngraphs and other popular data approximators.\n","authors":"A. N. Gorban|E. M. Mirkes|A. Zinovyev","affiliations":"","link_abstract":"http://arxiv.org/abs/1603.06828v2","link_pdf":"http://arxiv.org/pdf/1603.06828v2","link_doi":"http://dx.doi.org/10.5445/KSP/1000058749/11","comment":"A talk given at ECDA2015 (European Conference on Data Analysis,\n September 2nd to 4th 2015, University of Essex, Colchester, UK), to be\n published in Archives of Data Science","journal_ref":"Archives of Data Science, Series A, Vol. 2, No. 1, 2017","doi":"10.5445/KSP/1000058749/11","primary_category":"cs.DS","categories":"cs.DS"} {"id":"1603.07839v1","submitted":"2016-03-25 08:02:41","updated":"2016-03-25 08:02:41","title":"Early Detection of Combustion Instabilities using Deep Convolutional\n Selective Autoencoders on Hi-speed Flame Video","abstract":" This paper proposes an end-to-end convolutional selective autoencoder\napproach for early detection of combustion instabilities using rapidly arriving\nflame image frames. The instabilities arising in combustion processes cause\nsignificant deterioration and safety issues in various human-engineered systems\nsuch as land and air based gas turbine engines. These properties are described\nas self-sustaining, large amplitude pressure oscillations and show varying\nspatial scales periodic coherent vortex structure shedding. However, such\ninstability is extremely difficult to detect before a combustion process\nbecomes completely unstable due to its sudden (bifurcation-type) nature. In\nthis context, an autoencoder is trained to selectively mask stable flame and\nallow unstable flame image frames. In that process, the model learns to\nidentify and extract rich descriptive and explanatory flame shape features.\nWith such a training scheme, the selective autoencoder is shown to be able to\ndetect subtle instability features as a combustion process makes transition\nfrom stable to unstable region. As a consequence, the deep learning tool-chain\ncan perform as an early detection framework for combustion instabilities that\nwill have a transformative impact on the safety and performance of modern\nengines.\n","authors":"Adedotun Akintayo|Kin Gwn Lore|Soumalya Sarkar|Soumik Sarkar","affiliations":"","link_abstract":"http://arxiv.org/abs/1603.07839v1","link_pdf":"http://arxiv.org/pdf/1603.07839v1","link_doi":"http://dx.doi.org/10.1145/1235","comment":"A 10 pages, 10 figures submission for Applied Data Science Track of\n KDD16","journal_ref":"","doi":"10.1145/1235","primary_category":"cs.CV","categories":"cs.CV|cs.LG|cs.NE"} {"id":"1603.08102v1","submitted":"2016-03-26 11:43:41","updated":"2016-03-26 11:43:41","title":"GENMR: Generalized Query Processing through Map Reduce In Cloud Database\n Management System","abstract":" Big Data, Cloud computing, Cloud Database Management techniques, Data Science\nand many more are the fantasizing words which are the future of IT industry.\nFor all the new techniques one common thing is that they deal with Data, not\njust Data but the Big Data. Users store their various kinds of data on cloud\nrepositories. Cloud Database Management System deals with such large sets of\ndata. For processing such gigantic amount of data, traditional approaches are\nnot suitable because these approaches are not able to handle such size of data.\nTo handle these, various solutions have been developed such as Hadoop, Map\nReduce Programming codes, HIVE, PIG etc. Map Reduce codes provides both\nscalability and reliability. But till date, users are habitual of SQL, Oracle\nkind of codes for dealing with data and they are not aware of Map Reduce codes.\nIn this paper, a generalized model GENMR has been implemented, which takes\nqueries written in various RDBMS forms like SQL, ORACLE, DB2, MYSQL and convert\ninto Map Reduce codes. A comparison has been done to evaluate the performance\nof GENMR with latest techniques like HIVE and PIG and it has been concluded\nthat GENMR shows much better performance as compare to both the techniques. We\nalso introduce an optimization technique for mapper placement problems to\nenhance the effect of parallelism which improves the performance of such\nAmalgam approach.\n","authors":"Shweta Malhotra|Mohammad Najmud Doja|Bashir Alam|Mansaf Alam","affiliations":"","link_abstract":"http://arxiv.org/abs/1603.08102v1","link_pdf":"http://arxiv.org/pdf/1603.08102v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DC","categories":"cs.DC"} {"id":"1603.08242v1","submitted":"2016-03-27 18:08:52","updated":"2016-03-27 18:08:52","title":"The Marshall-Olkin extended generalized Gompertz distribution","abstract":" A new four-parameter model called the Marshall-Olkin extended generalized\nGompertz distribution is introduced. Its hazard rate function can be constant,\nincreasing, decreasing, upside-down bathtub or bathtub-shaped depending on its\nparameters. Some mathematical properties of this model such as expansion for\nthe density function, moments, moment generating function, quantile function,\nmean deviations, mean residual life, order statistics and R\\'enyi entropy are\nderived. The maximum likelihood technique is used to estimate the unknown model\nparameters and the observed information matrix is determined. The applicability\nof the proposed model is shown by means of a real data set.\n","authors":"Lazhar Benkhelifa","affiliations":"","link_abstract":"http://arxiv.org/abs/1603.08242v1","link_pdf":"http://arxiv.org/pdf/1603.08242v1","link_doi":"","comment":"","journal_ref":"Journal of Data Science 15 (2), 239-266 (2017)","doi":"","primary_category":"math.ST","categories":"math.ST|stat.TH"} {"id":"1604.02608v1","submitted":"2016-04-09 20:46:01","updated":"2016-04-09 20:46:01","title":"A Case for Data Commons: Towards Data Science as a Service","abstract":" As the amount of scientific data continues to grow at ever faster rates, the\nresearch community is increasingly in need of flexible computational\ninfrastructure that can support the entirety of the data science lifecycle,\nincluding long-term data storage, data exploration and discovery services, and\ncompute capabilities to support data analysis and re-analysis, as new data are\nadded and as scientific pipelines are refined. We describe our experience\ndeveloping data commons-- interoperable infrastructure that co-locates data,\nstorage, and compute with common analysis tools--and present several cases\nstudies. Across these case studies, several common requirements emerge,\nincluding the need for persistent digital identifier and metadata services,\nAPIs, data portability, pay for compute capabilities, and data peering\nagreements between data commons. Though many challenges, including\nsustainability and developing appropriate standards remain, interoperable data\ncommons bring us one step closer to effective Data Science as Service for the\nscientific research community.\n","authors":"Robert L. Grossman|Allison Heath|Mark Murphy|Maria Patterson|Walt Wells","affiliations":"","link_abstract":"http://arxiv.org/abs/1604.02608v1","link_pdf":"http://arxiv.org/pdf/1604.02608v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.DC"} {"id":"1604.03160v1","submitted":"2016-04-11 21:42:47","updated":"2016-04-11 21:42:47","title":"Towards a Privacy Research Roadmap for the Computing Community","abstract":" Great advances in computing and communication technology are bringing many\nbenefits to society, with transformative changes and financial opportunities\nbeing created in health care, transportation, education, law enforcement,\nnational security, commerce, and social interactions. Many of these benefits,\nhowever, involve the use of sensitive personal data, and thereby raise concerns\nabout privacy. Failure to address these concerns can lead to a loss of trust in\nthe private and public institutions that handle personal data, and can stifle\nthe independent thought and expression that is needed for our democracy to\nflourish.\n This report, sponsored by the Computing Community Consortium (CCC), suggests\na roadmap for privacy research over the next decade, aimed at enabling society\nto appropriately control threats to privacy while enjoying the benefits of\ninformation technology and data science. We hope that it will be useful to the\nagencies of the Federal Networking and Information Technology Research and\nDevelopment (NITRD) Program as they develop a joint National Privacy Research\nStrategy over the coming months. The report synthesizes input drawn from the\nprivacy and computing communities submitted to both the CCC and NITRD, as well\nas past reports on the topic.\n","authors":"Lorrie Cranor|Tal Rabin|Vitaly Shmatikov|Salil Vadhan|Daniel Weitzner","affiliations":"","link_abstract":"http://arxiv.org/abs/1604.03160v1","link_pdf":"http://arxiv.org/pdf/1604.03160v1","link_doi":"","comment":"A Computing Community Consortium (CCC) white paper, 23 pages","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1604.04639v1","submitted":"2016-04-15 20:43:20","updated":"2016-04-15 20:43:20","title":"ModelWizard: Toward Interactive Model Construction","abstract":" Data scientists engage in model construction to discover machine learning\nmodels that well explain a dataset, in terms of predictiveness,\nunderstandability and generalization across domains. Questions such as \"what if\nwe model common cause Z\" and \"what if Y's dependence on X reverses\" inspire\nmany candidate models to consider and compare, yet current tools emphasize\nconstructing a final model all at once.\n To more naturally reflect exploration when debating numerous models, we\npropose an interactive model construction framework grounded in composable\noperations. Primitive operations capture core steps refining data and model\nthat, when verified, form an inductive basis to prove model validity. Derived,\ncomposite operations enable advanced model families, both generic and\nspecialized, abstracted away from low-level details.\n We prototype our envisioned framework in ModelWizard, a domain-specific\nlanguage embedded in F# to construct Tabular models. We enumerate language\ndesign and demonstrate its use through several applications, emphasizing how\nlanguage may facilitate creation of complex models. To future engineers\ndesigning data science languages and tools, we offer ModelWizard's design as a\nnew model construction paradigm, speeding discovery of our universe's\nstructure.\n","authors":"Dylan Hutchison","affiliations":"","link_abstract":"http://arxiv.org/abs/1604.04639v1","link_pdf":"http://arxiv.org/pdf/1604.04639v1","link_doi":"","comment":"Master's Thesis","journal_ref":"","doi":"","primary_category":"cs.PL","categories":"cs.PL|cs.LG"} {"id":"1604.05676v2","submitted":"2016-04-19 18:06:37","updated":"2016-06-16 15:27:46","title":"Scientific Computing, High-Performance Computing and Data Science in\n Higher Education","abstract":" We present an overview of current academic curricula for Scientific\nComputing, High-Performance Computing and Data Science. After a survey of\ncurrent academic and non-academic programs across the globe, we focus on\nCanadian programs and specifically on the education program of the SciNet HPC\nConsortium, using its detailed enrollment and course statistics for the past\nfour to five years. Not only do these data display a steady and rapid increase\nin the demand for research-computing instruction, they also show a clear shift\nfrom traditional (high performance) computing to data-oriented methods. It is\nargued that this growing demand warrants specialized research computing\ndegrees. The possible curricula of such degrees are described next, taking\nexisting programs as an example, and adding SciNet's experiences of student\ndesires as well as trends in advanced research computing.\n","authors":"Marcelo Ponce|Erik Spence|Daniel Gruner|Ramses van Zon","affiliations":"","link_abstract":"http://arxiv.org/abs/1604.05676v2","link_pdf":"http://arxiv.org/pdf/1604.05676v2","link_doi":"http://dx.doi.org/10.22369/issn.2153-4136/10/1/5","comment":"Updated discussion and title","journal_ref":"Journal of Computational Science Education vol 10 (2019)","doi":"10.22369/issn.2153-4136/10/1/5","primary_category":"cs.CY","categories":"cs.CY|cs.DC"} {"id":"1604.07397v1","submitted":"2016-04-25 18:26:51","updated":"2016-04-25 18:26:51","title":"Teaching Data Science","abstract":" We describe an introductory data science course, entitled Introduction to\nData Science, offered at the University of Illinois at Urbana-Champaign. The\ncourse introduced general programming concepts by using the Python programming\nlanguage with an emphasis on data preparation, processing, and presentation.\nThe course had no prerequisites, and students were not expected to have any\nprogramming experience. This introductory course was designed to cover a wide\nrange of topics, from the nature of data, to storage, to visualization, to\nprobability and statistical analysis, to cloud and high performance computing,\nwithout becoming overly focused on any one subject. We conclude this article\nwith a discussion of lessons learned and our plans to develop new data science\ncourses.\n","authors":"Robert J. Brunner|Edward J. Kim","affiliations":"","link_abstract":"http://arxiv.org/abs/1604.07397v1","link_pdf":"http://arxiv.org/pdf/1604.07397v1","link_doi":"","comment":"10 pages, 4 figures, International Conference on Computational\n Science (ICCS 2016)","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT|cs.CY|physics.ed-ph"} {"id":"1605.00085v1","submitted":"2016-04-30 09:30:14","updated":"2016-04-30 09:30:14","title":"Usage of Cloud Computing Simulators and Future Systems For Computational\n Research","abstract":" Cloud Computing is an Internet based computing, whereby shared resources,\nsoftware and information, are provided to computers and devices on demand, like\nthe electricity grid. Currently, IaaS (Infrastructure as a Service), PaaS\n(Platform as a Service) and SaaS (Software as a Service) are used as a business\nmodel for Cloud Computing. Nowadays, the adoption and deployment of Cloud\nComputing is increasing in various domains, forcing researchers to conduct\nresearch in the area of Cloud Computing globally. Setting up the research\nenvironment is critical for the researchers in the developing countries to\nevaluate the research outputs. Currently, modeling, simulation technology and\naccess of resources from various university data centers has become a useful\nand powerful tool in cloud computing research. Several cloud simulators have\nbeen specifically developed by various universities to carry out Cloud\nComputing research, including CloudSim, SPECI, Green Cloud and Future Systems\n(the Indiana University machines India, Bravo, Delta, Echo and Foxtrot)\nsupports leading edge data science research and a broad range of\ncomputing-enabled education as well as integration of ideas from cloud and HPC\nsystems. In this paper, the features, suitability, adaptability and the\nlearning curve of the existing Cloud Computing simulators and Future Systems\nare reviewed and analyzed.\n","authors":"Ramkumar Lakshminarayanan|Rajasekar Ramalingam","affiliations":"","link_abstract":"http://arxiv.org/abs/1605.00085v1","link_pdf":"http://arxiv.org/pdf/1605.00085v1","link_doi":"","comment":"ETRT-ICT 2016 - College of Applied Sciences, Salalah, Sultanate of\n Oman","journal_ref":"","doi":"","primary_category":"cs.DC","categories":"cs.DC"} {"id":"1605.03035v1","submitted":"2016-05-10 14:37:16","updated":"2016-05-10 14:37:16","title":"Context-Aware Adaptive Framework for e-Health Monitoring","abstract":" For improving e-health services, we propose a context-aware framework to\nmonitor the activities of daily living of dependent persons. We define a\nstrategy for generating long-term realistic scenarios and a framework\ncontaining an adaptive monitoring algorithm based on three approaches for\noptimizing resource usage. The used approaches provide a deep knowledge about\nthe person's context by considering: the person's profile, the activities and\nthe relationships between activities. We evaluate the performances of our\nframework and show its adaptability and significant reduction in network,\nenergy and processing usage over a traditional monitoring implementation.\n","authors":"Haider Mshali|Tayeb Lemlouma|Damien Magoni","affiliations":"","link_abstract":"http://arxiv.org/abs/1605.03035v1","link_pdf":"http://arxiv.org/pdf/1605.03035v1","link_doi":"http://dx.doi.org/10.1109/DSDIS.2015.13","comment":"8 pages, Proceedings of the 11th IEEE Global Communications\n Conference (merged with the IEEE DSDIS2015), Sydney, Australi,11-13 December,\n 2015","journal_ref":"IEEE International Conference on Data Science and Data Intensive\n Systems (2015) 276-283","doi":"10.1109/DSDIS.2015.13","primary_category":"cs.CY","categories":"cs.CY|cs.AI|cs.NI"} {"id":"1605.04148v1","submitted":"2016-05-13 12:17:18","updated":"2016-05-13 12:17:18","title":"How to compute the barycenter of a weighted graph","abstract":" Discrete structures like graphs make it possible to naturally and flexibly\nmodel complex phenomena. Since graphs that represent various types of\ninformation are increasingly available today, their analysis has become a\npopular subject of research. The graphs studied in the field of data science at\nthis time generally have a large number of nodes that are not fairly weighted\nand connected to each other, translating a structural specification of the\ndata. Yet, even an algorithm for locating the average position in graphs is\nlacking although this knowledge would be of primary interest for statistical or\nrepresentation problems. In this work, we develop a stochastic algorithm for\nfinding the Frechet mean of weighted undirected metric graphs. This method\nrelies on a noisy simulated annealing algorithm dealt with using\nhomogenization. We then illustrate our algorithm with two examples (subgraphs\nof a social network and of a collaboration and citation network).\n","authors":"Sébastien Gadat|Ioana Gavra|Laurent Risser","affiliations":"","link_abstract":"http://arxiv.org/abs/1605.04148v1","link_pdf":"http://arxiv.org/pdf/1605.04148v1","link_doi":"","comment":"5 figures, 2 tables","journal_ref":"","doi":"","primary_category":"math.PR","categories":"math.PR|math.OC|math.ST|stat.TH"} {"id":"1605.04983v1","submitted":"2016-05-16 23:18:37","updated":"2016-05-16 23:18:37","title":"Decomposition Methods for Nonlinear Optimization and Data Mining","abstract":" We focus on two central themes in this dissertation. The first one is on\ndecomposing polytopes and polynomials in ways that allow us to perform\nnonlinear optimization. We start off by explaining important results on\ndecomposing a polytope into special polyhedra. We use these decompositions and\ndevelop methods for computing a special class of integrals exactly. Namely, we\nare interested in computing the exact value of integrals of polynomial\nfunctions over convex polyhedra. We present prior work and new extensions of\nthe integration algorithms. Every integration method we present requires that\nthe polynomial has a special form. We explore two special polynomial\ndecomposition algorithms that are useful for integrating polynomial functions.\nBoth polynomial decompositions have strengths and weaknesses, and we experiment\nwith how to practically use them.\n After developing practical algorithms and efficient software tools for\nintegrating a polynomial over a polytope, we focus on the problem of maximizing\na polynomial function over the continuous domain of a polytope. This\nmaximization problem is NP-hard, but we develop approximation methods that run\nin polynomial time when the dimension is fixed. Moreover, our algorithm for\napproximating the maximum of a polynomial over a polytope is related to\nintegrating the polynomial over the polytope. We show how the integration\nmethods can be used for optimization.\n The second central topic in this dissertation is on problems in data science.\nWe first consider a heuristic for mixed-integer linear optimization. We show\nhow many practical mixed-integer linear have a special substructure containing\nset partition constraints. We then describe a nice data structure for finding\nfeasible zero-one integer solutions to systems of set partition constraints.\nFinally, we end with an applied project using data science methods in medical\nresearch.\n","authors":"Brandon Dutra","affiliations":"","link_abstract":"http://arxiv.org/abs/1605.04983v1","link_pdf":"http://arxiv.org/pdf/1605.04983v1","link_doi":"","comment":"PHD Thesis of Brandon Dutra","journal_ref":"","doi":"","primary_category":"math.CO","categories":"math.CO"} {"id":"1605.05422v2","submitted":"2016-05-18 02:46:14","updated":"2016-05-24 06:38:18","title":"Optimization Beyond Prediction: Prescriptive Price Optimization","abstract":" This paper addresses a novel data science problem, prescriptive price\noptimization, which derives the optimal price strategy to maximize future\nprofit/revenue on the basis of massive predictive formulas produced by machine\nlearning. The prescriptive price optimization first builds sales forecast\nformulas of multiple products, on the basis of historical data, which reveal\ncomplex relationships between sales and prices, such as price elasticity of\ndemand and cannibalization. Then, it constructs a mathematical optimization\nproblem on the basis of those predictive formulas. We present that the\noptimization problem can be formulated as an instance of binary quadratic\nprogramming (BQP). Although BQP problems are NP-hard in general and\ncomputationally intractable, we propose a fast approximation algorithm using a\nsemi-definite programming (SDP) relaxation, which is closely related to the\nGoemans-Williamson's Max-Cut approximation. Our experiments on simulation and\nreal retail datasets show that our prescriptive price optimization\nsimultaneously derives the optimal prices of tens/hundreds products with\npractical computational time, that potentially improve 8.2% of gross profit of\nthose products.\n","authors":"Shinji Ito|Ryohei Fujimaki","affiliations":"","link_abstract":"http://arxiv.org/abs/1605.05422v2","link_pdf":"http://arxiv.org/pdf/1605.05422v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.OC","categories":"math.OC|cs.LG|stat.ML"} {"id":"1605.08846v1","submitted":"2016-05-28 04:57:13","updated":"2016-05-28 04:57:13","title":"A Human-Centered Approach to Data Privacy : Political Economy, Power,\n and Collective Data Subjects","abstract":" Researchers find weaknesses in current strategies for protecting privacy in\nlarge datasets. Many anonymized datasets are reidentifiable, and norms for\noffering data subjects notice and consent over emphasize individual\nresponsibility. Based on fieldwork with data managers in the City of Seattle, I\nidentify ways that these conventional approaches break down in practice.\nDrawing on work from theorists in sociocultural anthropology, I propose that a\nHuman Centered Data Science move beyond concepts like dataset identifiability\nand sensitivity toward a broader ontology of who is implicated by a dataset,\nand new ways of anticipating how data can be combined and used.\n","authors":"Meg Young","affiliations":"","link_abstract":"http://arxiv.org/abs/1605.08846v1","link_pdf":"http://arxiv.org/pdf/1605.08846v1","link_doi":"","comment":"This is a workshop paper accepted to the Human-Centered Data Science\n Workshop at the Computer Supported Collaborative Work Conference in 2016","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1606.04153v1","submitted":"2016-06-13 22:02:13","updated":"2016-06-13 22:02:13","title":"Universal temporal features of rankings in competitive sports and games","abstract":" Many complex phenomena, from the selection of traits in biological systems to\nhierarchy formation in social and economic entities, show signs of competition\nand heterogeneous performance in the temporal evolution of their components,\nwhich may eventually lead to stratified structures such as the wealth\ndistribution worldwide. However, it is still unclear whether the road to\nhierarchical complexity is determined by the particularities of each phenomena,\nor if there are universal mechanisms of stratification common to many systems.\nHuman sports and games, with their (varied but simplified) rules of competition\nand measures of performance, serve as an ideal test bed to look for universal\nfeatures of hierarchy formation. With this goal in mind, we analyse here the\nbehaviour of players and team rankings over time for several sports and games.\nEven though, for a given time, the distribution of performance ranks varies\nacross activities, we find statistical regularities in the dynamics of ranks.\nSpecifically the rank diversity, a measure of the number of elements occupying\na given rank over a length of time, has the same functional form in sports and\ngames as in languages, another system where competition is determined by the\nuse or disuse of grammatical structures. Our results support the notion that\nhierarchical phenomena may be driven by the same underlying mechanisms of rank\nformation, regardless of the nature of their components. Moreover, such\nregularities can in principle be used to predict lifetimes of rank occupancy,\nthus increasing our ability to forecast stratification in the presence of\ncompetition.\n","authors":"José A. Morales|Sergio Sánchez|Jorge Flores|Carlos Pineda|Carlos Gershenson|Germinal Cocho|Jerónimo Zizumbo|Gerardo Iñiguez","affiliations":"","link_abstract":"http://arxiv.org/abs/1606.04153v1","link_pdf":"http://arxiv.org/pdf/1606.04153v1","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-016-0096-y","comment":"","journal_ref":"EPJ Data Science 5:33 (2016)","doi":"10.1140/epjds/s13688-016-0096-y","primary_category":"physics.soc-ph","categories":"physics.soc-ph|nlin.AO|stat.AP"} {"id":"1606.04456v1","submitted":"2016-06-14 16:55:01","updated":"2016-06-14 16:55:01","title":"Towards Operator-less Data Centers Through Data-Driven, Predictive,\n Proactive Autonomics","abstract":" Continued reliance on human operators for managing data centers is a major\nimpediment for them from ever reaching extreme dimensions. Large computer\nsystems in general, and data centers in particular, will ultimately be managed\nusing predictive computational and executable models obtained through\ndata-science tools, and at that point, the intervention of humans will be\nlimited to setting high-level goals and policies rather than performing\nlow-level operations. Data-driven autonomics, where management and control are\nbased on holistic predictive models that are built and updated using live data,\nopens one possible path towards limiting the role of operators in data centers.\nIn this paper, we present a data-science study of a public Google dataset\ncollected in a 12K-node cluster with the goal of building and evaluating\npredictive models for node failures. Our results support the practicality of a\ndata-driven approach by showing the effectiveness of predictive models based on\ndata found in typical data center logs. We use BigQuery, the big data SQL\nplatform from the Google Cloud suite, to process massive amounts of data and\ngenerate a rich feature set characterizing node state over time. We describe\nhow an ensemble classifier can be built out of many Random Forest classifiers\neach trained on these features, to predict if nodes will fail in a future\n24-hour window. Our evaluation reveals that if we limit false positive rates to\n5%, we can achieve true positive rates between 27% and 88% with precision\nvarying between 50% and 72%.This level of performance allows us to recover\nlarge fraction of jobs' executions (by redirecting them to other nodes when a\nfailure of the present node is predicted) that would otherwise have been wasted\ndue to failures. [...]\n","authors":"Alina Sîrbu|Ozalp Babaoglu","affiliations":"","link_abstract":"http://arxiv.org/abs/1606.04456v1","link_pdf":"http://arxiv.org/pdf/1606.04456v1","link_doi":"http://dx.doi.org/10.1007/s10586-016-0564-y","comment":"","journal_ref":"Cluster Computing, Volume 19, Issue 2, pp 865-878, 2016","doi":"10.1007/s10586-016-0564-y","primary_category":"cs.DC","categories":"cs.DC"} {"id":"1606.06769v1","submitted":"2016-06-21 21:04:06","updated":"2016-06-21 21:04:06","title":"Network Analysis of Urban Traffic with Big Bus Data","abstract":" Urban traffic analysis is crucial for traffic forecasting systems, urban\nplanning and, more recently, various mobile and network applications. In this\npaper, we analyse urban traffic with network and statistical methods. Our\nanalysis is based on one big bus dataset containing 45 million bus arrival\nsamples in Helsinki. We mainly address following questions: 1. How can we\nidentify the areas that cause most of the traffic in the city? 2. Why there is\na urban traffic? Is bus traffic a key cause of the urban traffic? 3. How can we\nimprove the urban traffic systems? To answer these questions, first, the\nbetweenness is used to identify the most import areas that cause most traffics.\nSecond, we find that bus traffic is not an important cause of urban traffic\nusing statistical methods. We differentiate the urban traffic and the bus\ntraffic in a city. We use bus delay as an identification of the urban traffic,\nand the number of bus as an identification of the bus traffic. Third, we give\nour solutions on how to improve urban traffic by the traffic simulation on road\nnetworks. We show that adding more buses during the peak time and providing\nbetter bus schedule plan in the hot areas like railway station, metro station,\nshopping malls etc. will reduce the urban traffic.\n","authors":"Kai Zhao","affiliations":"","link_abstract":"http://arxiv.org/abs/1606.06769v1","link_pdf":"http://arxiv.org/pdf/1606.06769v1","link_doi":"","comment":"This technical report won the best hack award in Big Data Science\n Hackathon, Helsinki,2015","journal_ref":"","doi":"","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.SI"} {"id":"1606.07042v1","submitted":"2016-06-22 18:58:59","updated":"2016-06-22 18:58:59","title":"Incentivizing Evaluation via Limited Access to Ground Truth:\n Peer-Prediction Makes Things Worse","abstract":" In many settings, an effective way of evaluating objects of interest is to\ncollect evaluations from dispersed individuals and to aggregate these\nevaluations together. Some examples are categorizing online content and\nevaluating student assignments via peer grading. For this data science problem,\none challenge is to motivate participants to conduct such evaluations carefully\nand to report them honestly, particularly when doing so is costly. Existing\napproaches, notably peer-prediction mechanisms, can incentivize truth telling\nin equilibrium. However, they also give rise to equilibria in which agents do\nnot pay the costs required to evaluate accurately, and hence fail to elicit\nuseful information. We show that this problem is unavoidable whenever agents\nare able to coordinate using low-cost signals about the items being evaluated\n(e.g., text labels or pictures). We then consider ways of circumventing this\nproblem by comparing agents' reports to ground truth, which is available in\npractice when there exist trusted evaluators---such as teaching assistants in\nthe peer grading scenario---who can perform a limited number of unbiased (but\nnoisy) evaluations. Of course, when such ground truth is available, a simpler\napproach is also possible: rewarding each agent based on agreement with ground\ntruth with some probability, and unconditionally rewarding the agent otherwise.\nSurprisingly, we show that the simpler mechanism achieves stronger incentive\nguarantees given less access to ground truth than a large set of\npeer-prediction mechanisms.\n","authors":"Alice Gao|James R. Wright|Kevin Leyton-Brown","affiliations":"","link_abstract":"http://arxiv.org/abs/1606.07042v1","link_pdf":"http://arxiv.org/pdf/1606.07042v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.GT","categories":"cs.GT"} {"id":"1606.07781v1","submitted":"2016-06-24 18:30:35","updated":"2016-06-24 18:30:35","title":"Harnessing the Power of the Crowd to Increase Capacity for Data Science\n in the Social Sector","abstract":" We present three case studies of organizations using a data science\ncompetition to answer a pressing question. The first is in education where a\nnonprofit that creates smart school budgets wanted to automatically tag budget\nline items. The second is in public health, where a low-cost, nonprofit women's\nhealth care provider wanted to understand the effect of demographic and\nbehavioral questions on predicting which services a woman would need. The third\nand final example is in government innovation: using online restaurant reviews\nfrom Yelp, competitors built models to forecast which restaurants were most\nlikely to have hygiene violations when visited by health inspectors. Finally,\nwe reflect on the unique benefits of the open, public competition model.\n","authors":"Peter Bull|Isaac Slavitt|Greg Lipstein","affiliations":"","link_abstract":"http://arxiv.org/abs/1606.07781v1","link_pdf":"http://arxiv.org/pdf/1606.07781v1","link_doi":"","comment":"Presented at 2016 ICML Workshop on #Data4Good: Machine Learning in\n Social Good Applications, New York, NY","journal_ref":"","doi":"","primary_category":"cs.HC","categories":"cs.HC|cs.CY|cs.SI|stat.ML"} {"id":"1607.00378v1","submitted":"2016-07-01 19:02:36","updated":"2016-07-01 19:02:36","title":"Want Drugs? Use Python","abstract":" We describe how Python can be leveraged to streamline the curation, modelling\nand dissemination of drug discovery data as well as the development of\ninnovative, freely available tools for the related scientific community. We\nlook at various examples, such as chemistry toolkits, machine-learning\napplications and web frameworks and show how Python can glue it all together to\ncreate efficient data science pipelines.\n","authors":"Michał Nowotka|George Papadatos|Mark Davies|Nathan Dedman|Anne Hersey","affiliations":"","link_abstract":"http://arxiv.org/abs/1607.00378v1","link_pdf":"http://arxiv.org/pdf/1607.00378v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.OH","categories":"cs.OH"} {"id":"1607.00858v1","submitted":"2016-07-04 12:40:15","updated":"2016-07-04 12:40:15","title":"Embracing Data Science","abstract":" Statistics is running the risk of appearing irrelevant to today's\nundergraduate students. Today's undergraduate students are familiar with data\nscience projects and they judge statistics against what they have seen.\nStatistics, especially at the introductory level, should take inspiration from\ndata science so that the discipline is not seen as somehow lesser than data\nscience. This article provides a brief overview of data science, outlines ideas\nfor how introductory courses could take inspiration from data science, and\nprovides a reference to materials for developing stand-alone data science\ncourses.\n","authors":"Adam Loy","affiliations":"","link_abstract":"http://arxiv.org/abs/1607.00858v1","link_pdf":"http://arxiv.org/pdf/1607.00858v1","link_doi":"","comment":"9 pages, 1 figure","journal_ref":"The UMAP Journal 36 (2015) 285-292","doi":"","primary_category":"stat.OT","categories":"stat.OT|cs.GL"} {"id":"1607.04940v3","submitted":"2016-07-18 02:51:02","updated":"2016-12-05 00:43:03","title":"An optimization approach to locally-biased graph algorithms","abstract":" Locally-biased graph algorithms are algorithms that attempt to find local or\nsmall-scale structure in a large data graph. In some cases, this can be\naccomplished by adding some sort of locality constraint and calling a\ntraditional graph algorithm; but more interesting are locally-biased graph\nalgorithms that compute answers by running a procedure that does not even look\nat most of the input graph. This corresponds more closely to what practitioners\nfrom various data science domains do, but it does not correspond well with the\nway that algorithmic and statistical theory is typically formulated. Recent\nwork from several research communities has focused on developing locally-biased\ngraph algorithms that come with strong complementary algorithmic and\nstatistical theory and that are useful in practice in downstream data science\napplications. We provide a review and overview of this work, highlighting\ncommonalities between seemingly-different approaches, and highlighting\npromising directions for future work.\n","authors":"Kimon Fountoulakis|David Gleich|Michael Mahoney","affiliations":"","link_abstract":"http://arxiv.org/abs/1607.04940v3","link_pdf":"http://arxiv.org/pdf/1607.04940v3","link_doi":"","comment":"19 pages, 13 figures","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI|cs.DS"} {"id":"1607.05251v1","submitted":"2016-07-18 19:39:23","updated":"2016-07-18 19:39:23","title":"The Emergence of Gravitational Wave Science: 100 Years of Development of\n Mathematical Theory, Detectors, Numerical Algorithms, and Data Analysis Tools","abstract":" On September 14, 2015, the newly upgraded Laser Interferometer\nGravitational-wave Observatory (LIGO) recorded a loud gravitational-wave (GW)\nsignal, emitted a billion light-years away by a coalescing binary of two\nstellar-mass black holes. The detection was announced in February 2016, in time\nfor the hundredth anniversary of Einstein's prediction of GWs within the theory\nof general relativity (GR). The signal represents the first direct detection of\nGWs, the first observation of a black-hole binary, and the first test of GR in\nits strong-field, high-velocity, nonlinear regime. In the remainder of its\nfirst observing run, LIGO observed two more signals from black-hole binaries,\none moderately loud, another at the boundary of statistical significance. The\ndetections mark the end of a decades-long quest, and the beginning of GW\nastronomy: finally, we are able to probe the unseen, electromagnetically dark\nUniverse by listening to it. In this article, we present a short historical\noverview of GW science: this young discipline combines GR, arguably the\ncrowning achievement of classical physics, with record-setting, ultra-low-noise\nlaser interferometry, and with some of the most powerful developments in the\ntheory of differential geometry, partial differential equations,\nhigh-performance computation, numerical analysis, signal processing,\nstatistical inference, and data science. Our emphasis is on the synergy between\nthese disciplines, and how mathematics, broadly understood, has historically\nplayed, and continues to play, a crucial role in the development of GW science.\nWe focus on black holes, which are very pure mathematical solutions of\nEinstein's gravitational-field equations that are nevertheless realized in\nNature, and that provided the first observed signals.\n","authors":"Michael Holst|Olivier Sarbach|Manuel Tiglio|Michele Vallisneri","affiliations":"","link_abstract":"http://arxiv.org/abs/1607.05251v1","link_pdf":"http://arxiv.org/pdf/1607.05251v1","link_doi":"","comment":"41 pages, 5 figures. To appear in Bulletin of the American\n Mathematical Society","journal_ref":"","doi":"","primary_category":"gr-qc","categories":"gr-qc|physics.comp-ph|physics.data-an|physics.hist-ph|physics.ins-det|83, 35, 65, 53, 68. 85"} {"id":"1607.05869v1","submitted":"2016-07-20 09:03:17","updated":"2016-07-20 09:03:17","title":"Indebted households profiling: a knowledge discovery from database\n approach","abstract":" A major challenge in consumer credit risk portfolio management is to classify\nhouseholds according to their risk profile. In order to build such risk\nprofiles it is necessary to employ an approach that analyses data\nsystematically in order to detect important relationships, interactions,\ndependencies and associations amongst the available continuous and categorical\nvariables altogether and accurately generate profiles of most interesting\nhousehold segments according to their credit risk. The objective of this work\nis to employ a knowledge discovery from database process to identify groups of\nindebted households and describe their profiles using a database collected by\nthe Consumer Credit Counselling Service (CCCS) in the UK. Employing a framework\nthat allows the usage of both categorical and continuous data altogether to\nfind hidden structures in unlabelled data it was established the ideal number\nof clusters and such clusters were described in order to identify the\nhouseholds who exhibit a high propensity of excessive debt levels.\n","authors":"Rodrigo Scarpel|Alexandros Ladas|Uwe Aickelin","affiliations":"","link_abstract":"http://arxiv.org/abs/1607.05869v1","link_pdf":"http://arxiv.org/pdf/1607.05869v1","link_doi":"","comment":"Annals of Data Science, 2 (1), pp. 43-59, 2015","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI|cs.CY"} {"id":"1607.05895v1","submitted":"2016-07-20 10:01:31","updated":"2016-07-20 10:01:31","title":"Adaptive Data Communication Interface: A User-Centric Visual Data\n Interpretation Framework","abstract":" In this position paper, we present ideas about creating a next generation\nframework towards an adaptive interface for data communication and\nvisualisation systems. Our objective is to develop a system that accepts large\ndata sets as inputs and provides user-centric, meaningful visual information to\nassist owners to make sense of their data collection. The proposed framework\ncomprises four stages: (i) the knowledge base compilation, where we search and\ncollect existing state-ofthe-art visualisation techniques per domain and user\npreferences; (ii) the development of the learning and inference system, where\nwe apply artificial intelligence techniques to learn, predict and recommend new\ngraphic interpretations (iii) results evaluation; and (iv) reinforcement and\nadaptation, where valid outputs are stored in our knowledge base and the system\nis iteratively tuned to address new demands. These stages, as well as our\noverall vision, limitations and possible challenges are introduced in this\narticle. We also discuss further extensions of this framework for other\nknowledge discovery tasks.\n","authors":"Grazziela P. Figueredo|Christian Wagner|Jonathan M. Garibaldi|Uwe Aickelin","affiliations":"","link_abstract":"http://arxiv.org/abs/1607.05895v1","link_pdf":"http://arxiv.org/pdf/1607.05895v1","link_doi":"http://dx.doi.org/10.1109/Trustcom.2015.571","comment":"The 9th IEEE International Conference on Big Data Science and\n Engineering (IEEE BigDataSE-15), pp. 128 - 135, 2015","journal_ref":"","doi":"10.1109/Trustcom.2015.571","primary_category":"cs.HC","categories":"cs.HC"} {"id":"1607.06190v1","submitted":"2016-07-21 04:57:16","updated":"2016-07-21 04:57:16","title":"An ensemble of machine learning and anti-learning methods for predicting\n tumour patient survival rates","abstract":" This paper primarily addresses a dataset relating to cellular, chemical and\nphysical conditions of patients gathered at the time they are operated upon to\nremove colorectal tumours. This data provides a unique insight into the\nbiochemical and immunological status of patients at the point of tumour removal\nalong with information about tumour classification and post-operative survival.\nThe relationship between severity of tumour, based on TNM staging, and survival\nis still unclear for patients with TNM stage 2 and 3 tumours. We ask whether it\nis possible to predict survival rate more accurately using a selection of\nmachine learning techniques applied to subsets of data to gain a deeper\nunderstanding of the relationships between a patient's biochemical markers and\nsurvival. We use a range of feature selection and single classification\ntechniques to predict the 5 year survival rate of TNM stage 2 and 3 patients\nwhich initially produces less than ideal results. The performance of each model\nindividually is then compared with subsets of the data where agreement is\nreached for multiple models. This novel method of selective ensembling\ndemonstrates that significant improvements in model accuracy on an unseen test\nset can be achieved for patients where agreement between models is achieved.\nFinally we point at a possible method to identify whether a patients prognosis\ncan be accurately predicted or not.\n","authors":"Christopher Roadknight|Durga Suryanarayanan|Uwe Aickelin|John Scholefield|Lindy Durrant","affiliations":"","link_abstract":"http://arxiv.org/abs/1607.06190v1","link_pdf":"http://arxiv.org/pdf/1607.06190v1","link_doi":"http://dx.doi.org/10.1109/DSAA.2015.7344863","comment":"IEEE International Conference on Data Science and Advanced Analytics\n (IEEE DSAA'2015), pp. 1-8, 2015. arXiv admin note: text overlap with\n arXiv:1307.1599, arXiv:1409.0788","journal_ref":"","doi":"10.1109/DSAA.2015.7344863","primary_category":"cs.LG","categories":"cs.LG"} {"id":"1607.08329v3","submitted":"2016-07-28 06:40:30","updated":"2016-12-22 21:52:12","title":"Robust Contextual Outlier Detection: Where Context Meets Sparsity","abstract":" Outlier detection is a fundamental data science task with applications\nranging from data cleaning to network security. Given the fundamental nature of\nthe task, this has been the subject of much research. Recently, a new class of\noutlier detection algorithms has emerged, called {\\it contextual outlier\ndetection}, and has shown improved performance when studying anomalous behavior\nin a specific context. However, as we point out in this article, such\napproaches have limited applicability in situations where the context is sparse\n(i.e. lacking a suitable frame of reference). Moreover, approaches developed to\ndate do not scale to large datasets. To address these problems, here we propose\na novel and robust approach alternative to the state-of-the-art called RObust\nContextual Outlier Detection (ROCOD). We utilize a local and global behavioral\nmodel based on the relevant contexts, which is then integrated in a natural and\nrobust fashion. We also present several optimizations to improve the\nscalability of the approach. We run ROCOD on both synthetic and real-world\ndatasets and demonstrate that it outperforms other competitive baselines on the\naxes of efficacy and efficiency (40X speedup compared to modern contextual\noutlier detection methods). We also drill down and perform a fine-grained\nanalysis to shed light on the rationale for the performance gains of ROCOD and\nreveal its effectiveness when handling objects with sparse contexts.\n","authors":"Jiongqian Liang|Srinivasan Parthasarathy","affiliations":"","link_abstract":"http://arxiv.org/abs/1607.08329v3","link_pdf":"http://arxiv.org/pdf/1607.08329v3","link_doi":"","comment":"11 pages. Extended version of CIKM'16 paper","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB|cs.AI"} {"id":"1607.08878v1","submitted":"2016-07-29 18:06:39","updated":"2016-07-29 18:06:39","title":"Identifying and Harnessing the Building Blocks of Machine Learning\n Pipelines for Sensible Initialization of a Data Science Automation Tool","abstract":" As data science continues to grow in popularity, there will be an increasing\nneed to make data science tools more scalable, flexible, and accessible. In\nparticular, automated machine learning (AutoML) systems seek to automate the\nprocess of designing and optimizing machine learning pipelines. In this\nchapter, we present a genetic programming-based AutoML system called TPOT that\noptimizes a series of feature preprocessors and machine learning models with\nthe goal of maximizing classification accuracy on a supervised classification\nproblem. Further, we analyze a large database of pipelines that were previously\nused to solve various supervised classification problems and identify 100 short\nseries of machine learning operations that appear the most frequently, which we\ncall the building blocks of machine learning pipelines. We harness these\nbuilding blocks to initialize TPOT with promising solutions, and find that this\nsensible initialization method significantly improves TPOT's performance on one\nbenchmark at no cost of significantly degrading performance on the others.\nThus, sensible initialization with machine learning pipeline building blocks\nshows promise for GP-based AutoML systems, and should be further refined in\nfuture work.\n","authors":"Randal S. Olson|Jason H. Moore","affiliations":"","link_abstract":"http://arxiv.org/abs/1607.08878v1","link_pdf":"http://arxiv.org/pdf/1607.08878v1","link_doi":"","comment":"13 pages, 5 figures, preprint of chapter to appear in GPTP 2016 book","journal_ref":"","doi":"","primary_category":"cs.NE","categories":"cs.NE|cs.AI|cs.LG"} {"id":"1608.00451v3","submitted":"2016-08-01 14:46:02","updated":"2020-01-31 00:09:09","title":"Numerical tolerance for spectral decompositions of random matrices","abstract":" We precisely quantify the impact of statistical error in the quality of a\nnumerical approximation to a random matrix eigendecomposition, and under mild\nconditions, we use this to introduce an optimal numerical tolerance for\nresidual error in spectral decompositions of random matrices. We demonstrate\nthat terminating an eigendecomposition algorithm when the numerical error and\nstatistical error are of the same order results in computational savings with\nno loss of accuracy. We also repair a flaw in a ubiquitous termination\ncondition, one in wide employ in several computational linear algebra\nimplementations. We illustrate the practical consequences of our stopping\ncriterion with an analysis of simulated and real networks. Our theoretical\nresults and real-data examples establish that the tradeoff between statistical\nand numerical error is of significant import for data science.\n","authors":"Avanti Athreya|Michael Kane|Bryan Lewis|Zachary Lubberts|Vince Lyzinski|Youngser Park|Carey E. Priebe|Minh Tang","affiliations":"","link_abstract":"http://arxiv.org/abs/1608.00451v3","link_pdf":"http://arxiv.org/pdf/1608.00451v3","link_doi":"","comment":"20 pages, 2 figures","journal_ref":"","doi":"","primary_category":"stat.CO","categories":"stat.CO|cs.NA|math.NA|15, 62, 65"} {"id":"1608.05127v1","submitted":"2016-08-17 23:30:04","updated":"2016-08-17 23:30:04","title":"A Bayesian Network approach to County-Level Corn Yield Prediction using\n historical data and expert knowledge","abstract":" Crop yield forecasting is the methodology of predicting crop yields prior to\nharvest. The availability of accurate yield prediction frameworks have enormous\nimplications from multiple standpoints, including impact on the crop commodity\nfutures markets, formulation of agricultural policy, as well as crop insurance\nrating. The focus of this work is to construct a corn yield predictor at the\ncounty scale. Corn yield (forecasting) depends on a complex, interconnected set\nof variables that include economic, agricultural, management and meteorological\nfactors. Conventional forecasting is either knowledge-based computer programs\n(that simulate plant-weather-soil-management interactions) coupled with\ntargeted surveys or statistical model based. The former is limited by the need\nfor painstaking calibration, while the latter is limited to univariate analysis\nor similar simplifying assumptions that fail to capture the complex\ninterdependencies affecting yield. In this paper, we propose a data-driven\napproach that is \"gray box\" i.e. that seamlessly utilizes expert knowledge in\nconstructing a statistical network model for corn yield forecasting. Our\nmultivariate gray box model is developed on Bayesian network analysis to build\na Directed Acyclic Graph (DAG) between predictors and yield. Starting from a\ncomplete graph connecting various carefully chosen variables and yield, expert\nknowledge is used to prune or strengthen edges connecting variables.\nSubsequently the structure (connectivity and edge weights) of the DAG that\nmaximizes the likelihood of observing the training data is identified via\noptimization. We curated an extensive set of historical data (1948-2012) for\neach of the 99 counties in Iowa as data to train the model.\n","authors":"Vikas Chawla|Hsiang Sing Naik|Adedotun Akintayo|Dermot Hayes|Patrick Schnable|Baskar Ganapathysubramanian|Soumik Sarkar","affiliations":"","link_abstract":"http://arxiv.org/abs/1608.05127v1","link_pdf":"http://arxiv.org/pdf/1608.05127v1","link_doi":"","comment":"8 pages, In Proceedings of the 22nd ACM SIGKDD Workshop on Data\n Science for Food, Energy and Water , 2016 (San Francisco, CA, USA)","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.AP|stat.ML"} {"id":"1608.06419v4","submitted":"2016-08-23 08:42:50","updated":"2017-07-04 09:16:51","title":"An alternative approach to the limits of predictability in human\n mobility","abstract":" Next place prediction algorithms are invaluable tools, capable of increasing\nthe efficiency of a wide variety of tasks, ranging from reducing the spreading\nof diseases to better resource management in areas such as urban planning. In\nthis work we estimate upper and lower limits on the predictability of human\nmobility to help assess the performance of competing algorithms. We do this\nusing GPS traces from 604 individuals participating in a multi year long\nexperiment, The Copenhagen Networks study. Earlier works, focusing on the\nprediction of a participant's whereabouts in the next time bin, have found very\nhigh upper limits (>90%). We show that these upper limits are highly dependent\non the choice of a spatiotemporal scales and mostly reflect stationarity, i.e.\nthe fact that people tend to not move during small changes in time. This leads\nus to propose an alternative approach, which aims to predict the next location,\nrather than the location in the next bin. Our approach is independent of the\ntemporal scale and introduces a natural length scale. By removing the effects\nof stationarity we show that the predictability of the next location is\nsignificantly lower (~71%) than the predictability of the location in the next\nbin.\n","authors":"Edin Lind Ikanovic|Anders Mollgaard","affiliations":"","link_abstract":"http://arxiv.org/abs/1608.06419v4","link_pdf":"http://arxiv.org/pdf/1608.06419v4","link_doi":"","comment":"Changed \"of\" to \"in\" in the title, so that it matches the version in\n EPJ Data Science","journal_ref":"","doi":"","primary_category":"physics.soc-ph","categories":"physics.soc-ph"} {"id":"1608.07846v1","submitted":"2016-08-28 19:51:31","updated":"2016-08-28 19:51:31","title":"Data Analytics using Ontologies of Management Theories: Towards\n Implementing 'From Theory to Practice'","abstract":" We explore how computational ontologies can be impactful vis-a-vis the\ndeveloping discipline of \"data science.\" We posit an approach wherein\nmanagement theories are represented as formal axioms, and then applied to draw\ninferences about data that reside in corporate databases. That is, management\ntheories would be implemented as rules within a data analytics engine. We\ndemonstrate a case study development of such an ontology by formally\nrepresenting an accounting theory in First-Order Logic. Though quite\npreliminary, the idea that an information technology, namely ontologies, can\npotentially actualize the academic cliche, \"From Theory to Practice,\" and be\napplicable to the burgeoning domain of data analytics is novel and exciting.\n","authors":"Henry M. Kim|Jackie Ho Nam Cheung|Marek Laskowski|Iryna Gel","affiliations":"","link_abstract":"http://arxiv.org/abs/1608.07846v1","link_pdf":"http://arxiv.org/pdf/1608.07846v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI"} {"id":"1609.00464v2","submitted":"2016-09-02 04:26:54","updated":"2016-09-05 15:06:45","title":"The Semantic Knowledge Graph: A compact, auto-generated model for\n real-time traversal and ranking of any relationship within a domain","abstract":" This paper describes a new kind of knowledge representation and mining system\nwhich we are calling the Semantic Knowledge Graph. At its heart, the Semantic\nKnowledge Graph leverages an inverted index, along with a complementary\nuninverted index, to represent nodes (terms) and edges (the documents within\nintersecting postings lists for multiple terms/nodes). This provides a layer of\nindirection between each pair of nodes and their corresponding edge, enabling\nedges to materialize dynamically from underlying corpus statistics. As a\nresult, any combination of nodes can have edges to any other nodes materialize\nand be scored to reveal latent relationships between the nodes. This provides\nnumerous benefits: the knowledge graph can be built automatically from a\nreal-world corpus of data, new nodes - along with their combined edges - can be\ninstantly materialized from any arbitrary combination of preexisting nodes\n(using set operations), and a full model of the semantic relationships between\nall entities within a domain can be represented and dynamically traversed using\na highly compact representation of the graph. Such a system has widespread\napplications in areas as diverse as knowledge modeling and reasoning, natural\nlanguage processing, anomaly detection, data cleansing, semantic search,\nanalytics, data classification, root cause analysis, and recommendations\nsystems. The main contribution of this paper is the introduction of a novel\nsystem - the Semantic Knowledge Graph - which is able to dynamically discover\nand score interesting relationships between any arbitrary combination of\nentities (words, phrases, or extracted concepts) through dynamically\nmaterializing nodes and edges from a compact graphical representation built\nautomatically from a corpus of data representative of a knowledge domain.\n","authors":"Trey Grainger|Khalifeh AlJadda|Mohammed Korayem|Andries Smith","affiliations":"","link_abstract":"http://arxiv.org/abs/1609.00464v2","link_pdf":"http://arxiv.org/pdf/1609.00464v2","link_doi":"","comment":"Accepted for publication in 2016 IEEE 3rd International Conference on\n Data Science and Advanced Analytics","journal_ref":"","doi":"","primary_category":"cs.IR","categories":"cs.IR|cs.AI|cs.CL"} {"id":"1609.02079v2","submitted":"2016-09-07 17:16:50","updated":"2017-03-17 02:10:05","title":"Vertex coloring of graphs via phase dynamics of coupled oscillatory\n networks","abstract":" While Boolean logic has been the backbone of digital information processing,\nthere are classes of computationally hard problems wherein this conventional\nparadigm is fundamentally inefficient. Vertex coloring of graphs, belonging to\nthe class of combinatorial optimization represents such a problem; and is well\nstudied for its wide spectrum of applications in data sciences, life sciences,\nsocial sciences and engineering and technology. This motivates alternate, and\nmore efficient non-Boolean pathways to their solution. Here, we demonstrate a\ncoupled relaxation oscillator based dynamical system that exploits the\ninsulator-metal transition in vanadium dioxide (VO2), to efficiently solve the\nvertex coloring of graphs. By harnessing the natural analogue between\noptimization, pertinent to graph coloring solutions, and energy minimization\nprocesses in highly parallel, interconnected dynamical systems, we harness the\nphysical manifestation of the latter process to approximate the optimal\ncoloring of k-partite graphs. We further indicate a fundamental connection\nbetween the eigen properties of a linear dynamical system and the spectral\nalgorithms that can solve approximate graph coloring. Our work not only\nelucidates a physics-based computing approach but also presents tantalizing\nopportunities for building customized analog co-processors for solving hard\nproblems efficiently.\n","authors":"Abhinav Parihar|Nikhil Shukla|Matthew Jerry|Suman Datta|Arijit Raychowdhury","affiliations":"","link_abstract":"http://arxiv.org/abs/1609.02079v2","link_pdf":"http://arxiv.org/pdf/1609.02079v2","link_doi":"http://dx.doi.org/10.1038/s41598-017-00825-1","comment":"","journal_ref":"Scientific Reports 7 (2017) 911","doi":"10.1038/s41598-017-00825-1","primary_category":"cs.ET","categories":"cs.ET|cond-mat.other|math.DS"} {"id":"1609.02208v1","submitted":"2016-09-07 22:11:39","updated":"2016-09-07 22:11:39","title":"Breaking the Bandwidth Barrier: Geometrical Adaptive Entropy Estimation","abstract":" Estimators of information theoretic measures such as entropy and mutual\ninformation are a basic workhorse for many downstream applications in modern\ndata science. State of the art approaches have been either geometric (nearest\nneighbor (NN) based) or kernel based (with a globally chosen bandwidth). In\nthis paper, we combine both these approaches to design new estimators of\nentropy and mutual information that outperform state of the art methods. Our\nestimator uses local bandwidth choices of $k$-NN distances with a finite $k$,\nindependent of the sample size. Such a local and data dependent choice improves\nperformance in practice, but the bandwidth is vanishing at a fast rate, leading\nto a non-vanishing bias. We show that the asymptotic bias of the proposed\nestimator is universal; it is independent of the underlying distribution.\nHence, it can be pre-computed and subtracted from the estimate. As a byproduct,\nwe obtain a unified way of obtaining both kernel and NN estimators. The\ncorresponding theoretical contribution relating the asymptotic geometry of\nnearest neighbors to order statistics is of independent mathematical interest.\n","authors":"Weihao Gao|Sewoong Oh|Pramod Viswanath","affiliations":"","link_abstract":"http://arxiv.org/abs/1609.02208v1","link_pdf":"http://arxiv.org/pdf/1609.02208v1","link_doi":"","comment":"24 pages 8 figures","journal_ref":"","doi":"","primary_category":"cs.IT","categories":"cs.IT|cs.LG|math.IT|stat.ML"} {"id":"1609.02655v4","submitted":"2016-09-09 04:22:03","updated":"2019-07-23 22:12:15","title":"Singularity structures and impacts on parameter estimation in finite\n mixtures of distributions","abstract":" Singularities of a statistical model are the elements of the model's\nparameter space which make the corresponding Fisher information matrix\ndegenerate. These are the points for which estimation techniques such as the\nmaximum likelihood estimator and standard Bayesian procedures do not admit the\nroot-$n$ parametric rate of convergence. We propose a general framework for the\nidentification of singularity structures of the parameter space of finite\nmixtures, and study the impacts of the singularity structures on minimax lower\nbounds and rates of convergence for the maximum likelihood estimator over a\ncompact parameter space. Our study makes explicit the deep links between model\nsingularities, parameter estimation convergence rates and minimax lower bounds,\nand the algebraic geometry of the parameter space for mixtures of continuous\ndistributions. The theory is applied to establish concrete convergence rates of\nparameter estimation for finite mixture of skew-normal distributions. This rich\nand increasingly popular mixture model is shown to exhibit a remarkably complex\nrange of asymptotic behaviors which have not been hitherto reported in the\nliterature.\n","authors":"Nhat Ho|XuanLong Nguyen","affiliations":"","link_abstract":"http://arxiv.org/abs/1609.02655v4","link_pdf":"http://arxiv.org/pdf/1609.02655v4","link_doi":"","comment":"87 pages. This version has improved introduction and expanded\n discussion of related work. An abridged version is to appear on SIAM Journal\n on Mathematics of Data Science","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|stat.ML|stat.TH"} {"id":"1609.03266v1","submitted":"2016-09-12 04:54:10","updated":"2016-09-12 04:54:10","title":"Recovering the History of Informed Consent for Data Science and Internet\n Industry Research Ethics","abstract":" Respect for persons is a cornerstone value for any conception of research\nethics--though how to best realize respect in practice is an ongoing question.\nIn the late 19th and early 20th centuries, \"informed consent\" emerged as a\nparticular way to operationalize respect in medical and behavioral research\ncontexts. Today, informed consent has been challenged by increasingly advanced\nnetworked information and communication technologies (ICTs) and the massive\namounts of data they produce--challenges that have led many researchers and\nprivate companies to abandon informed consent as untenable or infeasible\nonline.\n Against any easy dismissal, we aim to recover insights from the history of\ninformed consent as it developed from the late 19th century to today. With a\nparticular focus on the United States policy context, we show how informed\nconsent is not a fixed or monolithic concept that should be abandoned in view\nof new data-intensive and technological practices, but rather it is a mechanism\nthat has always been fluid--it has constantly evolved alongside the specific\ncontexts and practices it is intended to regulate. Building on this insight, we\narticulate some specific challenges and lessons from the history of informed\nconsent that stand to benefit current discussions of informed consent and\nresearch ethics in the context of data science and Internet industry research.\n","authors":"Elaine Sedenberg|Anna Lauren Hoffmann","affiliations":"","link_abstract":"http://arxiv.org/abs/1609.03266v1","link_pdf":"http://arxiv.org/pdf/1609.03266v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1609.04373v1","submitted":"2016-09-14 18:41:23","updated":"2016-09-14 18:41:23","title":"Consistency of Social Sensing Signatures Across Major US Cities","abstract":" Previous studies have shown that Twitter users have biases to tweet from\ncertain locations, locational bias, and during certain hours, temporal bias. We\nused three years of geolocated Twitter Data to quantify these biases and test\nour central hypothesis that Twitter users biases are consistent across US\ncities. Our results suggest that temporal and locational bias of Twitter users\nare inconsistent between three US metropolitan cities. We derive conclusions\nabout the role of the complexity of the underlying data producing process on\nits consistency and argue for the potential research avenue for Geospatial Data\nScience to test and quantify these inconsistencies in the class of organically\nevolved Big Data.\n","authors":"Aiman Soliman|Kiumars Soltani|Anand Padmanabhan|Shaowen Wang","affiliations":"","link_abstract":"http://arxiv.org/abs/1609.04373v1","link_pdf":"http://arxiv.org/pdf/1609.04373v1","link_doi":"","comment":"CyberGIS16","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI|physics.soc-ph"} {"id":"1609.05148v8","submitted":"2016-09-16 17:29:01","updated":"2018-12-06 07:02:50","title":"Discovering and Deciphering Relationships Across Disparate Data\n Modalities","abstract":" Understanding the relationships between different properties of data, such as\nwhether a connectome or genome has information about disease status, is\nbecoming increasingly important in modern biological datasets. While existing\napproaches can test whether two properties are related, they often require\nunfeasibly large sample sizes in real data scenarios, and do not provide any\ninsight into how or why the procedure reached its decision. Our approach,\n\"Multiscale Graph Correlation\" (MGC), is a dependence test that juxtaposes\npreviously disparate data science techniques, including k-nearest neighbors,\nkernel methods (such as support vector machines), and multiscale analysis (such\nas wavelets). Other methods typically require double or triple the number\nsamples to achieve the same statistical power as MGC in a benchmark suite\nincluding high-dimensional and nonlinear relationships - spanning polynomial\n(linear, quadratic, cubic), trigonometric (sinusoidal, circular, ellipsoidal,\nspiral), geometric (square, diamond, W-shape), and other functions, with\ndimensionality ranging from 1 to 1000. Moreover, MGC uniquely provides a simple\nand elegant characterization of the potentially complex latent geometry\nunderlying the relationship, providing insight while maintaining computational\nefficiency. In several real data applications, including brain imaging and\ncancer genetics, MGC is the only method that can both detect the presence of a\ndependency and provide specific guidance for the next experiment and/or\nanalysis to conduct.\n","authors":"Joshua T. Vogelstein|Eric Bridgeford|Qing Wang|Carey E. Priebe|Mauro Maggioni|Cencheng Shen","affiliations":"","link_abstract":"http://arxiv.org/abs/1609.05148v8","link_pdf":"http://arxiv.org/pdf/1609.05148v8","link_doi":"http://dx.doi.org/10.7554/eLife.41690","comment":"","journal_ref":"eLife 2019;8:e41690","doi":"10.7554/eLife.41690","primary_category":"stat.ML","categories":"stat.ML"} {"id":"1609.05359v1","submitted":"2016-09-17 16:27:38","updated":"2016-09-17 16:27:38","title":"A Knowledge Ecosystem for the Food, Energy, and Water System","abstract":" Food, energy, and water (FEW) are key resources to sustain human life and\neconomic growth. There is an increasing stress on these interconnected\nresources due to population growth, natural disasters, and human activities.\nNew research is necessary to foster more efficient, more secure, and safer use\nof FEW resources in the U.S. and globally. In this position paper, we present\nthe idea of a knowledge ecosystem for enabling the semantic data integration of\nheterogeneous datasets in the FEW system to promote knowledge discovery and\nsuperior decision making through semantic reasoning. Rich, diverse datasets\npublished by U.S. federal agencies will be utilized. Our knowledge ecosystem\nwill build on Semantic Web technologies and advances in statistical relational\nlearning to (a) represent, integrate, and harmonize diverse data sources and\n(b) perform ontology-based reasoning to discover actionable insights from FEW\ndatasets.\n","authors":"Praveen Rao|Anas Katib|Daniel E. Lopez Barron","affiliations":"","link_abstract":"http://arxiv.org/abs/1609.05359v1","link_pdf":"http://arxiv.org/pdf/1609.05359v1","link_doi":"","comment":"KDD 2016 Workshop on Data Science for Food, Energy and Water, Aug\n 13-17, 2016, San Francisco, CA","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB"} {"id":"1609.05835v1","submitted":"2016-09-19 17:27:56","updated":"2016-09-19 17:27:56","title":"Scope for Machine Learning in Digital Manufacturing","abstract":" This provocation paper provides an overview of the underlying optimisation\nproblem in the emerging field of Digital Manufacturing. Initially, this paper\ndiscusses how the notion of Digital Manufacturing is transforming from a term\ndescribing a suite of software tools for the integration of production and\ndesign functions towards a more general concept incorporating computerised\nmanufacturing and supply chain processes, as well as information collection and\nutilisation across the product life cycle. On this basis, we use the example of\none such manufacturing process, Additive Manufacturing, to identify an\nintegrated multi-objective optimisation problem underlying Digital\nManufacturing. Forming an opportunity for a concurrent application of data\nscience and optimisation, a set of challenges arising from this problem is\noutlined.\n","authors":"Martin Baumers|Ender Ozcan","affiliations":"","link_abstract":"http://arxiv.org/abs/1609.05835v1","link_pdf":"http://arxiv.org/pdf/1609.05835v1","link_doi":"","comment":"Royal Society Workshop on Realising the Benefits of Machine Learning\n in Manufacturing","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI|cs.CY"} {"id":"1609.05855v1","submitted":"2016-09-19 18:46:44","updated":"2016-09-19 18:46:44","title":"Switching ferromagnetic spins by an ultrafast laser pulse: Emergence of\n giant optical spin-orbit torque","abstract":" Faster magnetic recording technology is indispensable to massive data storage\nand big data sciences. {All-optical spin switching offers a possible solution},\nbut at present it is limited to a handful of expensive and complex rare-earth\nferrimagnets. The spin switching in more abundant ferromagnets may\nsignificantly expand the scope of all-optical spin switching. Here by studying\n40,000 ferromagnetic spins, we show that it is the optical spin-orbit torque\nthat determines the course of spin switching in both ferromagnets and\nferrimagnets. Spin switching occurs only if the effective spin angular momentum\nof each constituent in an alloy exceeds a critical value. Because of the strong\nexchange coupling, the spin switches much faster in ferromagnets than\nweakly-coupled ferrimagnets. This establishes a paradigm for all-optical spin\nswitching. The resultant magnetic field (65 T) is so big that it will\nsignificantly reduce high current in spintronics, thus representing the\nbeginning of photospintronics.\n","authors":"G. P. Zhang|Y. H. Bai|T. F. George","affiliations":"","link_abstract":"http://arxiv.org/abs/1609.05855v1","link_pdf":"http://arxiv.org/pdf/1609.05855v1","link_doi":"http://dx.doi.org/10.1209/0295-5075/115/57003","comment":"12 page2, 6 figures. Accepted to Europhysics Letters (2016). Extended\n version with the supplementary information. Contribution from Indiana State\n University,Europhysics Letters (2016)","journal_ref":"","doi":"10.1209/0295-5075/115/57003","primary_category":"cond-mat.mtrl-sci","categories":"cond-mat.mtrl-sci"} {"id":"1609.08550v1","submitted":"2016-09-27 17:50:41","updated":"2016-09-27 17:50:41","title":"Correct classification for big/smart/fast data machine learning","abstract":" Table (database) / Relational database Classification for big/smart/fast data\nmachine learning is one of the most important tasks of predictive analytics and\nextracting valuable information from data. It is core applied technique for\nwhat now understood under data science and/or artificial intelligence. Widely\nused Decision Tree (Random Forest) and rare used rule based PRISM , VFST, etc\nclassifiers are empirical substitutions of theoretically correct to use Boolean\nfunctions minimization. Developing Minimization of Boolean functions algorithms\nis started long time ago by Edward Veitch's 1952. Since it, big efforts by wide\nscientific/industrial community was done to find feasible solution of Boolean\nfunctions minimization. In this paper we propose consider table data\nclassification from mathematical point of view, as minimization of Boolean\nfunctions. It is shown that data representation may be transformed to Boolean\nfunctions form and how to use known algorithms. For simplicity, binary output\nfunction is used for development, what opens doors for multivalued outputs\ndevelopments.\n","authors":"Sander Stepanov","affiliations":"","link_abstract":"http://arxiv.org/abs/1609.08550v1","link_pdf":"http://arxiv.org/pdf/1609.08550v1","link_doi":"","comment":"15 pages","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.IT|math.IT"} {"id":"1609.08776v1","submitted":"2016-09-28 05:41:16","updated":"2016-09-28 05:41:16","title":"Connecting Data Science and Qualitative Interview Insights through\n Sentiment Analysis to Assess Migrants' Emotion States Post-Settlement","abstract":" Large-scale survey research by social scientists offers general\nunderstandings of migrants' challenges and provides assessments of\npost-migration benchmarks like employment, obtention of educational\ncredentials, and home ownership. Minimal research, however, probes the realm of\nemotions or \"feeling states\" in migration and settlement processes, and it is\noften approached through closed-ended survey questions that superficially\nassess feeling states. The evaluation of emotions in migration and settlement\nhas been largely left to qualitative researchers using in-depth, interpretive\nmethods like semi-structured interviewing. This approach also has major\nlimitations, namely small sample sizes that capture limited geographic\ncontexts, heavy time burdens analyzing data, and limits to analytic consistency\ngiven the nuances of qualitative data coding. Information about migrant emotion\nstates, however, would be valuable to governments and NGOs to enable policy and\nprogram development tailored to migrant challenges and frustrations, and would\nthereby stimulate economic development through thriving migrant populations. In\nthis paper, we present an interdisciplinary pilot project that offers a way\nthrough the methodological impasse by subjecting exhaustive qualitative\ninterviews of migrants to sentiment analysis using the Python NLTK toolkit. We\npropose that data scientists can efficiently and accurately produce large-scale\nassessments of migrant feeling states through collaboration with social\nscientists.\n","authors":"Sarah Knudson|Srijita Sarkar|Abhik Ray","affiliations":"University of Saskatchewan|University of Saskatchewan|Washington State University","link_abstract":"http://arxiv.org/abs/1609.08776v1","link_pdf":"http://arxiv.org/pdf/1609.08776v1","link_doi":"","comment":"Presented at the Data For Good Exchange 2016","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1609.09582v1","submitted":"2016-09-30 03:36:03","updated":"2016-09-30 03:36:03","title":"Digitizing Municipal Street Inspections Using Computer Vision","abstract":" \"People want an authority to tell them how to value things. But they chose\nthis authority not based on facts or results. They chose it because it seems\nauthoritative and familiar.\" - The Big Short\n The pavement condition index is one such a familiar measure used by many US\ncities to measure street quality and justify billions of dollars spent every\nyear on street repair. These billion-dollar decisions are based on evaluation\ncriteria that are subjective and not representative. In this paper, we build\nupon our initial submission to D4GX 2015 that approaches this problem of\ninformation asymmetry in municipal decision-making.\n We describe a process to identify street-defects using computer vision\ntechniques on data collected using the Street Quality Identification Device\n(SQUID). A User Interface to host a large quantity of image data towards\ndigitizing the street inspection process and enabling actionable intelligence\nfor a core public service is also described. This approach of combining device,\ndata and decision-making around street repair enables cities make targeted\ndecisions about street repair and could lead to an anticipatory response which\ncan result in significant cost savings. Lastly, we share lessons learnt from\nthe deployment of SQUID in the city of Syracuse, NY.\n","authors":"Varun Adibhatla|Shi Fan|Krystof Litomisky|Patrick Atwater","affiliations":"ARGO Labs|NYU Center for Data Science|ARGO Labs|ARGO Labs","link_abstract":"http://arxiv.org/abs/1609.09582v1","link_pdf":"http://arxiv.org/pdf/1609.09582v1","link_doi":"","comment":"Presented at the Data For Good Exchange 2016","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.CV"} {"id":"1611.05788v1","submitted":"2016-09-30 03:49:16","updated":"2016-09-30 03:49:16","title":"Data Science in Service of Performing Arts: Applying Machine Learning to\n Predicting Audience Preferences","abstract":" Performing arts organizations aim to enrich their communities through the\narts. To do this, they strive to match their performance offerings to the taste\nof those communities. Success relies on understanding audience preference and\npredicting their behavior. Similar to most e-commerce or digital entertainment\nfirms, arts presenters need to recommend the right performance to the right\ncustomer at the right time. As part of the Michigan Data Science Team (MDST),\nwe partnered with the University Musical Society (UMS), a non-profit performing\narts presenter housed in the University of Michigan, Ann Arbor. We are\nproviding UMS with analysis and business intelligence, utilizing historical\nindividual-level sales data. We built a recommendation system based on\ncollaborative filtering, gaining insights into the artistic preferences of\ncustomers, along with the similarities between performances. To better\nunderstand audience behavior, we used statistical methods from customer-base\nanalysis. We characterized customer heterogeneity via segmentation, and we\nmodeled customer cohorts to understand and predict ticket purchasing patterns.\nFinally, we combined statistical modeling with natural language processing\n(NLP) to explore the impact of wording in program descriptions. These ongoing\nefforts provide a platform to launch targeted marketing campaigns, helping UMS\ncarry out its mission by allocating its resources more efficiently. Celebrating\nits 138th season, UMS is a 2014 recipient of the National Medal of Arts, and it\ncontinues to enrich communities by connecting world-renowned artists with\ndiverse audiences, especially students in their formative years. We aim to\ncontribute to that mission through data science and customer analytics.\n","authors":"Jacob Abernethy|Cyrus Anderson|Alex Chojnacki|Chengyu Dai|John Dryden|Eric Schwartz|Wenbo Shen|Jonathan Stroud|Laura Wendlandt|Sheng Yang|Daniel Zhang","affiliations":"University of Michigan|University of Michigan|University of Michigan|University of Michigan|University of Michigan|University of Michigan|University of Michigan|University of Michigan|University of Michigan|University of Michigan|University of Michigan","link_abstract":"http://arxiv.org/abs/1611.05788v1","link_pdf":"http://arxiv.org/pdf/1611.05788v1","link_doi":"","comment":"Presented at the Data For Good Exchange 2016","journal_ref":"","doi":"","primary_category":"stat.AP","categories":"stat.AP|cs.DB|cs.LG"} {"id":"1609.09758v3","submitted":"2016-09-30 14:43:47","updated":"2016-10-07 17:51:17","title":"Harnessing the Potential of the American Community Survey: Delving into\n Methods of Data Delivery","abstract":" The American Community Survey (ACS) is the bedrock underpinning any analysis\nof the US population, urban areas included. The Census Bureau delivers the ACS\ndata in multiple formats, yet in each the raw data is difficult to export in\nbulk and difficult to sift through. We argue that Enigma's approach to the data\ndelivery, such as our raw data and metadata presentation, reflects the survey's\nlogical structure. It can be explored, interlinked, and searched; making it\neasier to retrieve the appropriate data applicable to a question at hand. We\nmake the use of data more liquid via curated tables and API access; even\nmetadata and notes from technical documentation are programmatically\naccessible. Additionally, we are working towards opening our scalable and\nreproducible ingestion process of ACS estimations. This paper details all of\nthe ways the Census Bureau currently makes the data available, the barriers\neach of these raise to applying this data in analysis and how our approach\novercomes them. Finally, this paper will address other recent innovations in\nmaking Census datasets more usable, the use cases suited to each and how they\nfit into the wider application of data science.\n","authors":"Eve Ahearn|Olga Ianiuk","affiliations":"Enigma Technologies Inc.|Enigma Technologies Inc.","link_abstract":"http://arxiv.org/abs/1609.09758v3","link_pdf":"http://arxiv.org/pdf/1609.09758v3","link_doi":"","comment":"Presented at the Data For Good Exchange 2016","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1610.00040v2","submitted":"2016-09-30 21:55:55","updated":"2017-01-12 20:38:20","title":"A Primer on Coordinate Descent Algorithms","abstract":" This monograph presents a class of algorithms called coordinate descent\nalgorithms for mathematicians, statisticians, and engineers outside the field\nof optimization. This particular class of algorithms has recently gained\npopularity due to their effectiveness in solving large-scale optimization\nproblems in machine learning, compressed sensing, image processing, and\ncomputational statistics. Coordinate descent algorithms solve optimization\nproblems by successively minimizing along each coordinate or coordinate\nhyperplane, which is ideal for parallelized and distributed computing. Avoiding\ndetailed technicalities and proofs, this monograph gives relevant theory and\nexamples for practitioners to effectively apply coordinate descent to modern\nproblems in data science and engineering.\n","authors":"Hao-Jun Michael Shi|Shenyinying Tu|Yangyang Xu|Wotao Yin","affiliations":"","link_abstract":"http://arxiv.org/abs/1610.00040v2","link_pdf":"http://arxiv.org/pdf/1610.00040v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.OC","categories":"math.OC|stat.ML"} {"id":"1610.00985v2","submitted":"2016-10-01 00:56:16","updated":"2018-06-01 21:04:53","title":"Key attributes of a modern statistical computing tool","abstract":" In the 1990s, statisticians began thinking in a principled way about how\ncomputation could better support the learning and doing of statistics. Since\nthen, the pace of software development has accelerated, advancements in\ncomputing and data science have moved the goalposts, and it is time to\nreassess. Software continues to be developed to help do and learn statistics,\nbut there is little critical evaluation of the resulting tools, and no accepted\nframework with which to critique them. This paper presents a set of attributes\nnecessary for a modern statistical computing tool. The framework was designed\nto be broadly applicable to both novice and expert users, with a particular\nfocus on making more supportive statistical computing environments. A modern\nstatistical computing tool should be accessible, provide easy entry, privilege\ndata as a first-order object, support exploratory and confirmatory analysis,\nallow for flexible plot creation, support randomization, be interactive,\ninclude inherent documentation, support narrative, publishing, and\nreproducibility, and be flexible to extensions. Ideally, all these attributes\ncould be incorporated into one tool, supporting users at all levels, but a more\nreasonable goal is for tools designed for novices and professionals to `reach\nacross the gap,' taking inspiration from each others' strengths.\n","authors":"Amelia McNamara","affiliations":"","link_abstract":"http://arxiv.org/abs/1610.00985v2","link_pdf":"http://arxiv.org/pdf/1610.00985v2","link_doi":"http://dx.doi.org/10.1080/00031305.2018.1482784","comment":"","journal_ref":"","doi":"10.1080/00031305.2018.1482784","primary_category":"stat.CO","categories":"stat.CO|cs.CY"} {"id":"1610.00890v5","submitted":"2016-10-04 08:03:30","updated":"2018-03-14 12:16:13","title":"The Embedded Homology of Hypergraphs and Applications","abstract":" Hypergraphs are mathematical models for many problems in data sciences. In\nrecent decades, the topological properties of hypergraphs have been studied and\nvarious kinds of (co)homologies have been constructed (cf. [3, 4, 12]). In this\npaper, generalising the usual homology of simplicial complexes, we define the\nembedded homology of hypergraphs as well as the persistent embedded homology of\nsequences of hypergraphs. As a generalisation of the Mayer-Vietoris sequence\nfor the homology of simplicial complexes, we give a Mayer-Vietoris sequence for\nthe embedded homology of hypergraphs. Moreover, as applications of the embedded\nhomology, we study acyclic hypergraphs and construct some indices for the data\nanalysis of hyper-networks.\n","authors":"Stephane Bressan|Jingyan Li|Shiquan Ren|Jie Wu","affiliations":"","link_abstract":"http://arxiv.org/abs/1610.00890v5","link_pdf":"http://arxiv.org/pdf/1610.00890v5","link_doi":"","comment":"20 pages","journal_ref":"","doi":"","primary_category":"math.AT","categories":"math.AT|Primary 55U10, 55U15, Secondary 68P05, 68P15"} {"id":"1610.00937v2","submitted":"2016-10-04 11:48:40","updated":"2016-10-11 15:14:51","title":"Sharpe portfolio using a cross-efficiency evaluation","abstract":" The Sharpe ratio is a way to compare the excess returns (over the risk free\nasset) of portfolios for each unit of volatility that is generated by a\nportfolio. In this paper we introduce a robust Sharpe ratio portfolio under the\nassumption that the risk free asset is unknown. We propose a robust portfolio\nthat maximizes the Sharpe ratio when the risk free asset is unknown, but is\nwithin a given interval. To compute the best Sharpe ratio portfolio all the\nSharpe ratios for any risk free asset are considered and compared by using the\nso-called cross-efficiency evaluation. An explicit expression of the\nCross-Eficiency Sharpe ratio portfolio is presented when short selling is\nallowed.\n","authors":"Juan F. Monge|Mercedes Landete|José L. Ruiz","affiliations":"","link_abstract":"http://arxiv.org/abs/1610.00937v2","link_pdf":"http://arxiv.org/pdf/1610.00937v2","link_doi":"http://dx.doi.org/10.1007/978-3-030-43384-0","comment":"","journal_ref":"Data Science and Productivity Analytics. International Series in\n Operations Research & Management Science, 2020","doi":"10.1007/978-3-030-43384-0","primary_category":"q-fin.PM","categories":"q-fin.PM|91Bxx"} {"id":"1610.04276v1","submitted":"2016-10-13 22:06:46","updated":"2016-10-13 22:06:46","title":"Perspectives on Surgical Data Science","abstract":" The availability of large amounts of data together with advances in\nanalytical techniques afford an opportunity to address difficult challenges in\nensuring that healthcare is safe, effective, efficient, patient-centered,\nequitable, and timely. Surgical care and training stand to tremendously gain\nthrough surgical data science. Herein, we discuss a few perspectives on the\nscope and objectives for surgical data science.\n","authors":"S. Swaroop Vedula|Masaru Ishii|Gregory D. Hager","affiliations":"","link_abstract":"http://arxiv.org/abs/1610.04276v1","link_pdf":"http://arxiv.org/pdf/1610.04276v1","link_doi":"","comment":"Workshop on Surgical Data Science, Heidelberg, Germany, June 20, 2016","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1610.04752v1","submitted":"2016-10-15 16:08:22","updated":"2016-10-15 16:08:22","title":"Preserving the value of large scale data analytics over time through\n selective re-computation","abstract":" A pervasive problem in Data Science is that the knowledge generated by\npossibly expensive analytics processes is subject to decay over time, as the\ndata used to compute it drifts, the algorithms used in the processes are\nimproved, and the external knowledge embodied by reference datasets used in the\ncomputation evolves. Deciding when such knowledge outcomes should be refreshed,\nfollowing a sequence of data change events, requires problem-specific functions\nto quantify their value and its decay over time, as well as models for\nestimating the cost of their re-computation. What makes this problem\nchallenging is the ambition to develop a decision support system for informing\ndata analytics re-computation decisions over time, that is both generic and\ncustomisable. With the help of a case study from genomics, in this vision paper\nwe offer an initial formalisation of this problem, highlight research\nchallenges, and outline a possible approach based on the collection and\nanalysis of metadata from a history of past computations.\n","authors":"Paolo Missier|Jacek Cala|Maisha Rathi","affiliations":"","link_abstract":"http://arxiv.org/abs/1610.04752v1","link_pdf":"http://arxiv.org/pdf/1610.04752v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB"} {"id":"1610.04963v1","submitted":"2016-10-17 03:22:58","updated":"2016-10-17 03:22:58","title":"ProvDB: A System for Lifecycle Management of Collaborative Analysis\n Workflows","abstract":" As data-driven methods are becoming pervasive in a wide variety of\ndisciplines, there is an urgent need to develop scalable and sustainable tools\nto simplify the process of data science, to make it easier to keep track of the\nanalyses being performed and datasets being generated, and to enable\nintrospection of the workflows. In this paper, we describe our vision of a\nunified provenance and metadata management system to support lifecycle\nmanagement of complex collaborative data science workflows. We argue that a\nlarge amount of information about the analysis processes and data artifacts\ncan, and should be, captured in a semi-passive manner; and we show that\nquerying and analyzing this information can not only simplify bookkeeping and\ndebugging tasks for data analysts but can also enable a rich new set of\ncapabilities like identifying flaws in the data science process itself. It can\nalso significantly reduce the time spent in fixing post-deployment problems\nthrough automated analysis and monitoring. We have implemented an initial\nprototype of our system, called ProvDB, on top of git (a version control\nsystem) and Neo4j (a graph database), and we describe its key features and\ncapabilities.\n","authors":"Hui Miao|Amit Chavan|Amol Deshpande","affiliations":"","link_abstract":"http://arxiv.org/abs/1610.04963v1","link_pdf":"http://arxiv.org/pdf/1610.04963v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB"} {"id":"1610.08098v2","submitted":"2016-10-25 21:13:55","updated":"2017-09-18 16:50:12","title":"The Effect of Pokémon Go on The Pulse of the City: A Natural\n Experiment","abstract":" Pok\\'emon Go, a location-based game that uses augmented reality techniques,\nreceived unprecedented media coverage due to claims that it allowed for greater\naccess to public spaces, increasing the number of people out on the streets,\nand generally improving health, social, and security indices. However, the true\nimpact of Pok\\'emon Go on people's mobility patterns in a city is still largely\nunknown. In this paper, we perform a natural experiment using data from mobile\nphone networks to evaluate the effect of Pok\\'emon Go on the pulse of a big\ncity: Santiago, capital of Chile. We found significant effects of the game on\nthe floating population of Santiago compared to movement prior to the game's\nrelease in August 2016: in the following week, up to 13.8\\% more people spent\ntime outside at certain times of the day, even if they do not seem to go out of\ntheir usual way. These effects were found by performing regressions using count\nmodels over the states of the cellphone network during each day under study.\nThe models used controlled for land use, daily patterns, and points of interest\nin the city.\n Our results indicate that, on business days, there are more people on the\nstreet at commuting times, meaning that people did not change their daily\nroutines but slightly adapted them to play the game. Conversely, on Saturday\nand Sunday night, people indeed went out to play, but favored places close to\nwhere they live.\n Even if the statistical effects of the game do not reflect the massive change\nin mobility behavior portrayed by the media, at least in terms of expanse, they\ndo show how \"the street\" may become a new place of leisure. This change should\nhave an impact on long-term infrastructure investment by city officials, and on\nthe drafting of public policies aimed at stimulating pedestrian traffic.\n","authors":"Eduardo Graells-Garrido|Leo Ferres|Diego Caro|Loreto Bravo","affiliations":"","link_abstract":"http://arxiv.org/abs/1610.08098v2","link_pdf":"http://arxiv.org/pdf/1610.08098v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-017-0119-3","comment":"23 pages, 7 figures. Published at EPJ Data Science","journal_ref":"","doi":"10.1140/epjds/s13688-017-0119-3","primary_category":"cs.SI","categories":"cs.SI"} {"id":"1610.08629v1","submitted":"2016-10-27 06:20:00","updated":"2016-10-27 06:20:00","title":"The Promise and Prejudice of Big Data in Intelligence Community","abstract":" Big data holds critical importance in the current generation of information\ntechnology, with applications ranging from financial, industrial, academic to\ndefense sectors. With the exponential rise of open source data from social\nmedia and increasing government monitoring, big data is now also linked with\nnational security, and subsequently to the intelligence community. In this\nstudy I review the scope of big data sciences in the functioning of\nintelligence community. The major part of my study focuses on the inherent\nlimitations of big data, which affects the intelligence agencies from gathering\nof information to anticipating surprises. The limiting factors range from\ntechnical to ethical issues connected with big data. My study concludes the\nneed of experts with domain knowledge from intelligence community to\nefficiently guide big data analysis for timely filling the knowledge gaps. As a\ncase study on limitations of using big data, I narrate some of the ongoing work\nin nuclear intelligence using simple analytics and argue on why big data\nanalysis in that case would lead to unnecessary complications. For further\ninvestigation, I highlight cases of crowdsource forecasting tournaments and\npredicting unrest from social media.\n","authors":"Karan Jani","affiliations":"","link_abstract":"http://arxiv.org/abs/1610.08629v1","link_pdf":"http://arxiv.org/pdf/1610.08629v1","link_doi":"","comment":"18 pages, 6 figures","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.CR"} {"id":"1611.00097v3","submitted":"2016-11-01 01:15:53","updated":"2018-04-23 19:00:58","title":"Scientific Literature Text Mining and the Case for Open Access","abstract":" \"Open access\" has become a central theme of journal reform in academic\npublishing. In this article, I examine the relationship between open access\npublishing and an important infrastructural element of a modern research\nenterprise, scientific literature text mining, or the use of data analytic\ntechniques to conduct meta-analyses and investigations into the scientific\ncorpus. I give a brief history of the open access movement, discuss novel\njournalistic practices, and an overview of data-driven investigation of the\nscientific corpus. I argue that particularly in an era where the veracity of\nmany research studies has been called into question, scientific literature text\nmining should be one of the key motivations for open access publishing, not\nonly in the basic sciences, but in the engineering and applied sciences as\nwell. The enormous benefits of unrestricted access to the research literature\nshould prompt scholars from all disciplines to lend their vocal support to\nenabling legal, wholesale access to the scientific literature as part of a data\nscience pipeline.\n","authors":"Gopal P. Sarma","affiliations":"","link_abstract":"http://arxiv.org/abs/1611.00097v3","link_pdf":"http://arxiv.org/pdf/1611.00097v3","link_doi":"http://dx.doi.org/10.21428/14888","comment":"5 pages","journal_ref":"Sarma G. Scientific Literature Text Mining and the Case for Open\n Access. The Journal of Open Engineering [Internet]. 2017 Dec 8; Available\n from:\n https://www.tjoe.org/pub/scientific-literature-text-mining-and-the-case-for-open-access","doi":"10.21428/14888","primary_category":"cs.CY","categories":"cs.CY|cs.DL|physics.soc-ph"} {"id":"1611.01851v3","submitted":"2016-11-06 21:40:32","updated":"2017-10-18 07:26:11","title":"Bayesian Optimisation with Prior Reuse for Motion Planning in Robot\n Soccer","abstract":" We integrate learning and motion planning for soccer playing differential\ndrive robots using Bayesian optimisation. Trajectories generated using\nend-slope cubic Bezier splines are first optimised globally through Bayesian\noptimisation for a set of candidate points with obstacles. The optimised\ntrajectories along with robot and obstacle positions and velocities are stored\nin a database. The closest planning situation is identified from the database\nusing k-Nearest Neighbour approach. It is further optimised online through\nreuse of prior information from previously optimised trajectory. Our approach\nreduces computation time of trajectory optimisation considerably. Velocity\nprofiling generates velocities consistent with robot kinodynamoic constraints,\nand avoids collision and slipping. Extensive testing is done on developed\nsimulator, as well as on physical differential drive robots. Our method shows\nmarked improvements in mitigating tracking error, and reducing traversal and\ncomputational time over competing techniques under the constraints of\nperforming tasks in real time.\n","authors":"Abhinav Agarwalla|Arnav Kumar Jain|KV Manohar|Arpit Saxena|Jayanta Mukhopadhyay","affiliations":"","link_abstract":"http://arxiv.org/abs/1611.01851v3","link_pdf":"http://arxiv.org/pdf/1611.01851v3","link_doi":"","comment":"Accepted at ACM India Joint Conference on Data Science and Management\n of Data 2018","journal_ref":"","doi":"","primary_category":"cs.RO","categories":"cs.RO"} {"id":"1611.02960v2","submitted":"2016-11-09 14:59:23","updated":"2016-11-28 16:36:44","title":"A Unified Maximum Likelihood Approach for Optimal Distribution Property\n Estimation","abstract":" The advent of data science has spurred interest in estimating properties of\ndistributions over large alphabets. Fundamental symmetric properties such as\nsupport size, support coverage, entropy, and proximity to uniformity, received\nmost attention, with each property estimated using a different technique and\noften intricate analysis tools.\n We prove that for all these properties, a single, simple, plug-in\nestimator---profile maximum likelihood (PML)---performs as well as the best\nspecialized techniques. This raises the possibility that PML may optimally\nestimate many other symmetric properties.\n","authors":"Jayadev Acharya|Hirakendu Das|Alon Orlitsky|Ananda Theertha Suresh","affiliations":"","link_abstract":"http://arxiv.org/abs/1611.02960v2","link_pdf":"http://arxiv.org/pdf/1611.02960v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.IT","categories":"cs.IT|cs.DS|cs.LG|math.IT"} {"id":"1611.06474v2","submitted":"2016-11-20 05:54:06","updated":"2017-08-22 08:27:52","title":"Nazr-CNN: Fine-Grained Classification of UAV Imagery for Damage\n Assessment","abstract":" We propose Nazr-CNN1, a deep learning pipeline for object detection and\nfine-grained classification in images acquired from Unmanned Aerial Vehicles\n(UAVs) for damage assessment and monitoring. Nazr-CNN consists of two\ncomponents. The function of the first component is to localize objects (e.g.\nhouses or infrastructure) in an image by carrying out a pixel-level\nclassification. In the second component, a hidden layer of a Convolutional\nNeural Network (CNN) is used to encode Fisher Vectors (FV) of the segments\ngenerated from the first component in order to help discriminate between\ndifferent levels of damage. To showcase our approach we use data from UAVs that\nwere deployed to assess the level of damage in the aftermath of a devastating\ncyclone that hit the island of Vanuatu in 2015. The collected images were\nlabeled by a crowdsourcing effort and the labeling categories consisted of\nfine-grained levels of damage to built structures. Since our data set is\nrelatively small, a pre- trained network for pixel-level classification and FV\nencoding was used. Nazr-CNN attains promising results both for object detection\nand damage assessment suggesting that the integrated pipeline is robust in the\nface of small data sets and labeling errors by annotators. While the focus of\nNazr-CNN is on assessment of UAV images in a post-disaster scenario, our\nsolution is general and can be applied in many diverse settings. We show one\nsuch case of transfer learning to assess the level of damage in aerial images\ncollected after a typhoon in Philippines.\n","authors":"N. Attari|F. Ofli|M. Awad|J. Lucas|S. Chawla","affiliations":"","link_abstract":"http://arxiv.org/abs/1611.06474v2","link_pdf":"http://arxiv.org/pdf/1611.06474v2","link_doi":"","comment":"Accepted for publication in the 4th IEEE International Conference on\n Data Science and Advanced Analytics (DSAA) 2017","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1611.07478v3","submitted":"2016-11-22 19:30:28","updated":"2016-12-08 08:24:15","title":"An unexpected unity among methods for interpreting model predictions","abstract":" Understanding why a model made a certain prediction is crucial in many data\nscience fields. Interpretable predictions engender appropriate trust and\nprovide insight into how the model may be improved. However, with large modern\ndatasets the best accuracy is often achieved by complex models even experts\nstruggle to interpret, which creates a tension between accuracy and\ninterpretability. Recently, several methods have been proposed for interpreting\npredictions from complex models by estimating the importance of input features.\nHere, we present how a model-agnostic additive representation of the importance\nof input features unifies current methods. This representation is optimal, in\nthe sense that it is the only set of additive values that satisfies important\nproperties. We show how we can leverage these properties to create novel visual\nexplanations of model predictions. The thread of unity that this representation\nweaves through the literature indicates that there are common principles to be\nlearned about the interpretation of model predictions that apply in many\nscenarios.\n","authors":"Scott Lundberg|Su-In Lee","affiliations":"","link_abstract":"http://arxiv.org/abs/1611.07478v3","link_pdf":"http://arxiv.org/pdf/1611.07478v3","link_doi":"","comment":"Presented at NIPS 2016 Workshop on Interpretable Machine Learning in\n Complex Systems","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI"} {"id":"1611.07502v1","submitted":"2016-11-22 20:31:18","updated":"2016-11-22 20:31:18","title":"Component-based Synthesis of Table Consolidation and Transformation\n Tasks from Examples","abstract":" This paper presents an example-driven synthesis technique for automating a\nlarge class of data preparation tasks that arise in data science. Given a set\nof input tables and an out- put table, our approach synthesizes a table\ntransformation program that performs the desired task. Our approach is not\nrestricted to a fixed set of DSL constructs and can synthesize programs from an\narbitrary set of components, including higher-order combinators. At a\nhigh-level, our approach performs type-directed enumerative search over partial\npro- grams but incorporates two key innovations that allow it to scale: First,\nour technique can utilize any first-order specification of the components and\nuses SMT-based deduction to reject partial programs. Second, our algorithm uses\npartial evaluation to increase the power of deduction and drive enumerative\nsearch. We have evaluated our synthesis algorithm on dozens of data preparation\ntasks obtained from on-line forums, and we show that our approach can\nautomatically solve a large class of problems encountered by R users.\n","authors":"Yu Feng|Ruben Martins|Jacob Van Geffen|Isil Dillig|Swarat Chaudhuri","affiliations":"","link_abstract":"http://arxiv.org/abs/1611.07502v1","link_pdf":"http://arxiv.org/pdf/1611.07502v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.PL","categories":"cs.PL|I.2.2"} {"id":"1611.07509v1","submitted":"2016-11-22 20:50:47","updated":"2016-11-22 20:50:47","title":"A causal framework for discovering and removing direct and indirect\n discrimination","abstract":" Anti-discrimination is an increasingly important task in data science. In\nthis paper, we investigate the problem of discovering both direct and indirect\ndiscrimination from the historical data, and removing the discriminatory\neffects before the data is used for predictive analysis (e.g., building\nclassifiers). We make use of the causal network to capture the causal structure\nof the data. Then we model direct and indirect discrimination as the\npath-specific effects, which explicitly distinguish the two types of\ndiscrimination as the causal effects transmitted along different paths in the\nnetwork. Based on that, we propose an effective algorithm for discovering\ndirect and indirect discrimination, as well as an algorithm for precisely\nremoving both types of discrimination while retaining good data utility.\nDifferent from previous works, our approaches can ensure that the predictive\nmodels built from the modified data will not incur discrimination in decision\nmaking. Experiments using real datasets show the effectiveness of our\napproaches.\n","authors":"Lu Zhang|Yongkai Wu|Xintao Wu","affiliations":"University of Arkansas|University of Arkansas|University of Arkansas","link_abstract":"http://arxiv.org/abs/1611.07509v1","link_pdf":"http://arxiv.org/pdf/1611.07509v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG"} {"id":"1611.08331v1","submitted":"2016-11-25 00:53:37","updated":"2016-11-25 00:53:37","title":"An Overview on Data Representation Learning: From Traditional Feature\n Learning to Recent Deep Learning","abstract":" Since about 100 years ago, to learn the intrinsic structure of data, many\nrepresentation learning approaches have been proposed, including both linear\nones and nonlinear ones, supervised ones and unsupervised ones. Particularly,\ndeep architectures are widely applied for representation learning in recent\nyears, and have delivered top results in many tasks, such as image\nclassification, object detection and speech recognition. In this paper, we\nreview the development of data representation learning methods. Specifically,\nwe investigate both traditional feature learning algorithms and\nstate-of-the-art deep learning models. The history of data representation\nlearning is introduced, while available resources (e.g. online course, tutorial\nand book information) and toolboxes are provided. Finally, we conclude this\npaper with remarks and some interesting research directions on data\nrepresentation learning.\n","authors":"Guoqiang Zhong|Li-Na Wang|Junyu Dong","affiliations":"","link_abstract":"http://arxiv.org/abs/1611.08331v1","link_pdf":"http://arxiv.org/pdf/1611.08331v1","link_doi":"","comment":"About 20 pages. Submitted to Journal of Finance and Data Science as\n an invited paper","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML|68T05"} {"id":"1611.09874v1","submitted":"2016-11-29 21:03:45","updated":"2016-11-29 21:03:45","title":"On the radial profile of gas-phase Fe/α ratio around distant\n galaxies","abstract":" This paper presents a study of the chemical compositions in cool gas around a\nsample of 27 intermediate-redshift galaxies. The sample comprises 13 massive\nquiescent galaxies at z=0.40-0.73 probed by QSO sightlines at projected\ndistances d=3-400 kpc, and 14 star-forming galaxies at z=0.10-1.24 probed by\nQSO sightlines at d=8-163 kpc. The main goal of this study is to examine the\nradial profiles of the gas-phase Fe/{\\alpha} ratio in galaxy halos based on the\nobserved Fe II to Mg II column density ratios. Because Mg+ and Fe+ share\nsimilar ionization potentials, the relative ionization correction is small in\nmoderately ionized gas and the observed ionic abundance ratio N(Fe II)/N(Mg II)\nplaces a lower limit to the underlying (Fe/Mg) elemental abundance ratio. For\nquiescent galaxies, a median and dispersion of log \n=-0.06+/-0.15 is found at d<~60 kpc, which declines to log \n<-0.3 at d>~100 kpc. On the other hand, star-forming galaxies exhibit log =-0.25+/-0.21 at d<~60 kpc and log \n=-0.9+/-0.4 at larger distances. Including possible differential dust depletion\nor ionization correction would only increase the inferred (Fe/Mg) ratio. The\nobserved N(FeII)/N(Mg II) implies super-solar Fe/{\\alpha} ratios in the inner\nhalo of quiescent galaxies. An enhanced Fe abundance indicates a substantial\ncontribution by Type Ia supernovae in the chemical enrichment, which is at\nleast comparable to what is observed in the solar neighborhood or in\nintracluster media but differs from young star-forming regions. In the outer\nhalos of quiescent galaxies and in halos around star-forming galaxies, however,\nthe observed N(Fe II)/N(Mg II) is consistent with an {\\alpha}-element enhanced\nenrichment pattern, suggesting a core-collapse supernovae dominated enrichment\nhistory.\n","authors":"Fakhri S. Zahedy|Hsiao-Wen Chen|Jean-René Gauthier|Michael Rauch","affiliations":"U Chicago|U Chicago|Data Science, Inc|Carnegie Obs","link_abstract":"http://arxiv.org/abs/1611.09874v1","link_pdf":"http://arxiv.org/pdf/1611.09874v1","link_doi":"http://dx.doi.org/10.1093/mnras/stw3124","comment":"12 pages, 6 figures, accepted for publication in MNRAS","journal_ref":"","doi":"10.1093/mnras/stw3124","primary_category":"astro-ph.GA","categories":"astro-ph.GA"} {"id":"1611.09981v1","submitted":"2016-11-30 02:59:15","updated":"2016-11-30 02:59:15","title":"Decoding from Pooled Data: Sharp Information-Theoretic Bounds","abstract":" Consider a population consisting of n individuals, each of whom has one of d\ntypes (e.g. their blood type, in which case d=4). We are allowed to query this\ndatabase by specifying a subset of the population, and in response we observe a\nnoiseless histogram (a d-dimensional vector of counts) of types of the pooled\nindividuals. This measurement model arises in practical situations such as\npooling of genetic data and may also be motivated by privacy considerations. We\nare interested in the number of queries one needs to unambiguously determine\nthe type of each individual. In this paper, we study this information-theoretic\nquestion under the random, dense setting where in each query, a random subset\nof individuals of size proportional to n is chosen. This makes the problem a\nparticular example of a random constraint satisfaction problem (CSP) with a\n\"planted\" solution. We establish almost matching upper and lower bounds on the\nminimum number of queries m such that there is no solution other than the\nplanted one with probability tending to 1 as n tends to infinity. Our proof\nrelies on the computation of the exact \"annealed free energy\" of this model in\nthe thermodynamic limit, which corresponds to the exponential rate of decay of\nthe expected number of solution to this planted CSP. As a by-product of the\nanalysis, we show an identity of independent interest relating the Gaussian\nintegral over the space of Eulerian flows of a graph to its spanning tree\npolynomial.\n","authors":"Ahmed El Alaoui|Aaditya Ramdas|Florent Krzakala|Lenka Zdeborova|Michael I. Jordan","affiliations":"","link_abstract":"http://arxiv.org/abs/1611.09981v1","link_pdf":"http://arxiv.org/pdf/1611.09981v1","link_doi":"http://dx.doi.org/10.1137/18M1183339","comment":"","journal_ref":"SIAM Journal on Mathematics of Data Science 1-1 (2019), pp.\n 161-188","doi":"10.1137/18M1183339","primary_category":"math.PR","categories":"math.PR|cs.IT|math.IT"} {"id":"1612.01785v1","submitted":"2016-12-06 12:52:44","updated":"2016-12-06 12:52:44","title":"Estimating Local Commuting Patterns From Geolocated Twitter Data","abstract":" The emergence of large stores of transactional data generated by increasing\nuse of digital devices presents a huge opportunity for policymakers to improve\ntheir knowledge of the local environment and thus make more informed and better\ndecisions. A research frontier is hence emerging which involves exploring the\ntype of measures that can be drawn from data stores such as mobile phone logs,\nInternet searches and contributions to social media platforms, and the extent\nto which these measures are accurate reflections of the wider population. This\npaper contributes to this research frontier, by exploring the extent to which\nlocal commuting patterns can be estimated from data drawn from Twitter. It\nmakes three contributions in particular. First, it shows that simple heuristics\ndrawn from geolocated Twitter data offer a good proxy for local commuting\npatterns; one which outperforms the major existing method for estimating these\npatterns (the radiation model). Second, it investigates sources of error in the\nproxy measure, showing that the model performs better on short trips with\nhigher volumes of commuters; it also looks at demographic biases but finds\nthat, surprisingly, measurements are not significantly affected by the fact\nthat the demographic makeup of Twitter users differs significantly from the\npopulation as a whole. Finally, it looks at potential ways of going beyond\nsimple heuristics by incorporating temporal information into models.\n","authors":"Graham McNeill|Jonathan Bright|Scott A. Hale","affiliations":"","link_abstract":"http://arxiv.org/abs/1612.01785v1","link_pdf":"http://arxiv.org/pdf/1612.01785v1","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-017-0120-x","comment":"","journal_ref":"Graham McNeill, Jonathan Bright and Scott A Hale (2017) Estimating\n local commuting patterns from geolocated Twitter data, EPJ Data Science\n 20176:24","doi":"10.1140/epjds/s13688-017-0120-x","primary_category":"cs.CY","categories":"cs.CY|cs.SI|physics.soc-ph"} {"id":"1612.05222v4","submitted":"2016-12-15 20:20:11","updated":"2019-01-28 21:25:15","title":"Multivariate Submodular Optimization","abstract":" Submodular functions have found a wealth of new applications in data science\nand machine learning models in recent years. This has been coupled with many\nalgorithmic advances in the area of submodular optimization: (SO)\n$\\min/\\max~f(S): S \\in \\mathcal{F}$, where $\\mathcal{F}$ is a given family of\nfeasible sets over a ground set $V$ and $f:2^V \\rightarrow \\mathbb{R}$ is\nsubmodular. In this work we focus on a more general class of \\emph{multivariate\nsubmodular optimization} (MVSO) problems: $\\min/\\max~f (S_1,S_2,\\ldots,S_k):\nS_1 \\uplus S_2 \\uplus \\cdots \\uplus S_k \\in \\mathcal{F}$. Here we use $\\uplus$\nto denote disjoint union and hence this model is attractive where resources are\nbeing allocated across $k$ agents, who share a `joint' multivariate nonnegative\nobjective $f(S_1,S_2,\\ldots,S_k)$ that captures some type of submodularity\n(i.e. diminishing returns) property. We provide some explicit examples and\npotential applications for this new framework.\n For maximization, we show that practical algorithms such as accelerated\ngreedy variants and distributed algorithms achieve good approximation\nguarantees for very general families (such as matroids and $p$-systems). For\narbitrary families, we show that monotone (resp. nonmonotone) MVSO admits an\n$\\alpha (1-1/e)$ (resp. $\\alpha \\cdot 0.385$) approximation whenever monotone\n(resp. nonmonotone) SO admits an $\\alpha$-approximation over the multilinear\nformulation. This substantially expands the family of tractable models for\nsubmodular maximization. For minimization, we show that if SO admits a\n$\\beta$-approximation over \\emph{modular} functions, then MVSO admits a\n$\\frac{\\beta \\cdot n}{1+(n-1)(1-c)}$-approximation where $c\\in [0,1]$ denotes\nthe curvature of $f$, and this is essentially tight. Finally, we prove that\nMVSO has an $\\alpha k$-approximation whenever SO admits an\n$\\alpha$-approximation over the convex formulation.\n","authors":"Richard Santiago|F. Bruce Shepherd","affiliations":"","link_abstract":"http://arxiv.org/abs/1612.05222v4","link_pdf":"http://arxiv.org/pdf/1612.05222v4","link_doi":"","comment":"arXiv admin note: text overlap with arXiv:1803.03767","journal_ref":"Proceedings of the 36th International Conference on Machine\n Learning (ICML), PMLR 97:5599-5609, 2019","doi":"","primary_category":"cs.DS","categories":"cs.DS"} {"id":"1612.06427v3","submitted":"2016-12-19 21:45:36","updated":"2017-04-07 15:00:46","title":"An introduction to infinite HMMs for single molecule data analysis","abstract":" The hidden Markov model (HMM) has been a workhorse of single molecule data\nanalysis and is now commonly used as a standalone tool in time series analysis\nor in conjunction with other analyses methods such as tracking. Here we provide\na conceptual introduction to an important generalization of the HMM which is\npoised to have a deep impact across Biophysics: the infinite hidden Markov\nmodel (iHMM). As a modeling tool, iHMMs can analyze sequential data without a\npriori setting a specific number of states as required for the traditional\n(finite) HMM. While the current literature on the iHMM is primarily intended\nfor audiences in Statistics, the idea is powerful and the iHMM's breadth in\napplicability outside Machine Learning and Data Science warrants a careful\nexposition. Here we explain the key ideas underlying the iHMM with a special\nemphasis on implementation and provide a description of a code we are making\nfreely available. In a companion article, we provide an important extension of\nthe iHMM to accommodate complications such as drift.\n","authors":"Ioannis Sgouralis|Steve Presse","affiliations":"","link_abstract":"http://arxiv.org/abs/1612.06427v3","link_pdf":"http://arxiv.org/pdf/1612.06427v3","link_doi":"http://dx.doi.org/10.1016/j.bpj.2017.04.027","comment":"","journal_ref":"","doi":"10.1016/j.bpj.2017.04.027","primary_category":"physics.data-an","categories":"physics.data-an|physics.bio-ph|physics.chem-ph"} {"id":"1612.06433v3","submitted":"2016-12-19 21:55:19","updated":"2017-04-07 15:02:41","title":"ICON: an adaptation of infinite HMMs for time traces with drift","abstract":" Bayesian nonparametric methods have recently transformed emerging areas\nwithin data science. One such promising method, the infinite hidden Markov\nmodel (iHMM), generalizes the HMM which itself has become a workhorse in single\nmolecule data analysis. The iHMM goes beyond the HMM by self-consistently\nlearning all parameters learned by the HMM in addition to learning the number\nof states without recourse to any model selection steps. Despite its\ngenerality, simple features (such as drift), common to single molecule time\ntraces, result in an over-interpretation of drift and the introduction of\nartifact states. Here we present an adaptation of the iHMM that can treat data\nwith drift originating from one or many traces (e.g. FRET). Our fully Bayesian\nmethod couples the iHMM to a continuous control process (drift)\nself-consistently learned while learning all other quantities determined by the\niHMM (including state numbers). A key advantage of this method is that all\ntraces -regardless of drift or states visited across traces- may now be treated\non an equal footing thereby eliminating user-dependent trace selection (based\non drift levels), pre-processing to remove drift and post-processing model\nselection on state number.\n","authors":"Ioannis Sgouralis|Steve Presse","affiliations":"","link_abstract":"http://arxiv.org/abs/1612.06433v3","link_pdf":"http://arxiv.org/pdf/1612.06433v3","link_doi":"http://dx.doi.org/10.1016/j.bpj.2017.04.009","comment":"","journal_ref":"","doi":"10.1016/j.bpj.2017.04.009","primary_category":"physics.data-an","categories":"physics.data-an|physics.bio-ph"} {"id":"1612.06661v2","submitted":"2016-12-20 13:44:34","updated":"2017-11-04 21:37:30","title":"Four lectures on probabilistic methods for data science","abstract":" Methods of high-dimensional probability play a central role in applications\nfor statistics, signal processing theoretical computer science and related\nfields. These lectures present a sample of particularly useful tools of\nhigh-dimensional probability, focusing on the classical and matrix Bernstein's\ninequality and the uniform matrix deviation inequality. We illustrate these\ntools with applications for dimension reduction, network analysis, covariance\nestimation, matrix completion and sparse signal recovery. The lectures are\ngeared towards beginning graduate students who have taken a rigorous course in\nprobability but may not have any experience in data science applications.\n","authors":"Roman Vershynin","affiliations":"","link_abstract":"http://arxiv.org/abs/1612.06661v2","link_pdf":"http://arxiv.org/pdf/1612.06661v2","link_doi":"","comment":"Lectures given at 2016 PCMI Graduate Summer School in Mathematics of\n Data. Some typos, inaccuracies fixed","journal_ref":"","doi":"","primary_category":"math.PR","categories":"math.PR|cs.DS|cs.IT|math.IT|math.ST|stat.TH|60-01, 62-01, 65-01, 60B20, 65Cxx, 60E15, 62Fxx"} {"id":"1612.07140v2","submitted":"2016-12-21 14:32:35","updated":"2017-05-15 11:28:19","title":"A Guide to Teaching Data Science","abstract":" Demand for data science education is surging and traditional courses offered\nby statistics departments are not meeting the needs of those seeking training.\nThis has led to a number of opinion pieces advocating for an update to the\nStatistics curriculum. The unifying recommendation is computing should play a\nmore prominent role. We strongly agree with this recommendation, but advocate\nthe main priority is to bring applications to the forefront as proposed by\nNolan and Speed (1999). We also argue that the individuals tasked with\ndeveloping data science courses should not only have statistical training, but\nalso have experience analyzing data with the main objective of solving\nreal-world problems. Here, we share a set of general principles and offer a\ndetailed guide derived from our successful experience developing and teaching a\ngraduate-level, introductory data science course centered entirely on case\nstudies. We argue for the importance of statistical thinking, as defined by\nWild and Pfannkuck (1999) and describe how our approach teaches students three\nkey skills needed to succeed in data science, which we refer to as creating,\nconnecting, and computing. This guide can also be used for statisticians\nwanting to gain more practical knowledge about data science before embarking on\nteaching an introductory course.\n","authors":"Stephanie C. Hicks|Rafael A. Irizarry","affiliations":"","link_abstract":"http://arxiv.org/abs/1612.07140v2","link_pdf":"http://arxiv.org/pdf/1612.07140v2","link_doi":"","comment":"2 tables, 3 figures, 2 supplemental figures","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT|cs.CY"} {"id":"1612.07408v1","submitted":"2016-12-22 01:20:27","updated":"2016-12-22 01:20:27","title":"Statistical Distances and Their Role in Robustness","abstract":" Statistical distances, divergences, and similar quantities have a large\nhistory and play a fundamental role in statistics, machine learning and\nassociated scientific disciplines. However, within the statistical literature,\nthis extensive role has too often been played out behind the scenes, with other\naspects of the statistical problems being viewed as more central, more\ninteresting, or more important. The behind the scenes role of statistical\ndistances shows up in estimation, where we often use estimators based on\nminimizing a distance, explicitly or implicitly, but rarely studying how the\nproperties of a distance determine the properties of the estimators. Distances\nare also prominent in goodness-of-fit, but the usual question we ask is \"how\npowerful is this method against a set of interesting alternatives\" not \"what\naspect of the distance between the hypothetical model and the alternative are\nwe measuring?\"\n Our focus is on describing the statistical properties of some of the distance\nmeasures we have found to be most important and most visible. We illustrate the\nrobust nature of Neyman's chi-squared and the non-robust nature of Pearson's\nchi-squared statistics and discuss the concept of discretization robustness.\n","authors":"Marianthi Markatou|Yang Chen|Georgios Afendras|Bruce G. Lindsay","affiliations":"","link_abstract":"http://arxiv.org/abs/1612.07408v1","link_pdf":"http://arxiv.org/pdf/1612.07408v1","link_doi":"http://dx.doi.org/10.1007/978-3-319-69416-0_1","comment":"23 pages","journal_ref":"New Advances in Statistics and Data Science 2017, 3-26","doi":"10.1007/978-3-319-69416-0_1","primary_category":"math.ST","categories":"math.ST|stat.TH"} {"id":"1612.07670v1","submitted":"2016-12-22 16:02:25","updated":"2016-12-22 16:02:25","title":"The out-of-source error in multi-source cross validation-type procedures","abstract":" A scientific phenomenon under study may often be manifested by data arising\nfrom processes, i.e. sources, that may describe this phenomenon. In this contex\nof multi-source data, we define the \"out-of-source\" error, that is the error\ncommitted when a new observation of unknown source origin is allocated to one\nof the sources using a rule that is trained on the known labeled data. We\npresent an unbiased estimator of this error, and discuss its variance. We\nderive natural and easily verifiable assumptions under which the consistency of\nour estimator is guaranteed for a broad class of loss functions and data\ndistributions. Finally, we evaluate our theoretical results via a simulation\nstudy.\n","authors":"Georgios Afendras|Marianthi Markatou","affiliations":"","link_abstract":"http://arxiv.org/abs/1612.07670v1","link_pdf":"http://arxiv.org/pdf/1612.07670v1","link_doi":"http://dx.doi.org/10.1007/978-3-319-69416-0_2","comment":"16 pages, 4 tables","journal_ref":"New Advances in Statistics and Data Science 2017, 27-44","doi":"10.1007/978-3-319-69416-0_2","primary_category":"math.ST","categories":"math.ST|stat.TH"} {"id":"1612.08645v1","submitted":"2016-12-23 13:39:55","updated":"2016-12-23 13:39:55","title":"Assisting humans to achieve optimal sleep by changing ambient\n temperature","abstract":" Environment plays a vital role in the sleep mechanism of a human. It has been\nshown from many studies that sleeping and waking environment, waking time and\nhours of sleep is of very significant importance which can result in sleeping\ndisorders and variety of diseases. This paper finds the sleep cycle of an\nindividual and according changes the ambient temperature to maximize his/her\nsleep efficiency. We suggest a method which will assist in increasing sleep\nefficiency. Using Fast-Fourier-Transformation (FFT) of heart rate signals to\nextract heart rate variability data such that low frequency / high frequency\n(LF/HF) power ratio we are detecting sleep stages using an automated algorithm\nand then applying feedback mechanism to alter the ambient temperature depending\nupon the sleep stage.\n","authors":"Vivek Gupta|Siddhant Mittal|Sandip Bhaumik|Raj Roy","affiliations":"","link_abstract":"http://arxiv.org/abs/1612.08645v1","link_pdf":"http://arxiv.org/pdf/1612.08645v1","link_doi":"","comment":"Accepted, to appear in IEEE International Conference on\n Bioinformatics and Biomedicine (BIBM) 2016, 2016 International Workshop on\n Biomedical and Health Informatics (BHI 2016) & 2016 Workshop on Health\n Informatics and Data Science (HI-DS 2016)","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1612.08544v2","submitted":"2016-12-27 09:14:16","updated":"2017-11-13 17:42:12","title":"Theory-guided Data Science: A New Paradigm for Scientific Discovery from\n Data","abstract":" Data science models, although successful in a number of commercial domains,\nhave had limited applicability in scientific problems involving complex\nphysical phenomena. Theory-guided data science (TGDS) is an emerging paradigm\nthat aims to leverage the wealth of scientific knowledge for improving the\neffectiveness of data science models in enabling scientific discovery. The\noverarching vision of TGDS is to introduce scientific consistency as an\nessential component for learning generalizable models. Further, by producing\nscientifically interpretable models, TGDS aims to advance our scientific\nunderstanding by discovering novel domain insights. Indeed, the paradigm of\nTGDS has started to gain prominence in a number of scientific disciplines such\nas turbulence modeling, material discovery, quantum chemistry, bio-medical\nscience, bio-marker discovery, climate science, and hydrology. In this paper,\nwe formally conceptualize the paradigm of TGDS and present a taxonomy of\nresearch themes in TGDS. We describe several approaches for integrating domain\nknowledge in different research themes using illustrative examples from\ndifferent disciplines. We also highlight some of the promising avenues of novel\nresearch for realizing the full potential of theory-guided data science.\n","authors":"Anuj Karpatne|Gowtham Atluri|James Faghmous|Michael Steinbach|Arindam Banerjee|Auroop Ganguly|Shashi Shekhar|Nagiza Samatova|Vipin Kumar","affiliations":"","link_abstract":"http://arxiv.org/abs/1612.08544v2","link_pdf":"http://arxiv.org/pdf/1612.08544v2","link_doi":"http://dx.doi.org/10.1109/TKDE.2017.2720168","comment":"","journal_ref":"IEEE Transactions on Knowledge and Data Engineering, 29(10),\n pp.2318-2331. 2017","doi":"10.1109/TKDE.2017.2720168","primary_category":"cs.LG","categories":"cs.LG|cs.AI|stat.ML"} {"id":"1701.00705v1","submitted":"2016-12-29 20:40:42","updated":"2016-12-29 20:40:42","title":"Using Big Data to Enhance the Bosch Production Line Performance: A\n Kaggle Challenge","abstract":" This paper describes our approach to the Bosch production line performance\nchallenge run by Kaggle.com. Maximizing the production yield is at the heart of\nthe manufacturing industry. At the Bosch assembly line, data is recorded for\nproducts as they progress through each stage. Data science methods are applied\nto this huge data repository consisting records of tests and measurements made\nfor each component along the assembly line to predict internal failures. We\nfound that it is possible to train a model that predicts which parts are most\nlikely to fail. Thus a smarter failure detection system can be built and the\nparts tagged likely to fail can be salvaged to decrease operating costs and\nincrease the profit margins.\n","authors":"Ankita Mangal|Nishant Kumar","affiliations":"","link_abstract":"http://arxiv.org/abs/1701.00705v1","link_pdf":"http://arxiv.org/pdf/1701.00705v1","link_doi":"http://dx.doi.org/10.1109/BigData.2016.7840826","comment":"IEEE Big Data 2016 Conference","journal_ref":"","doi":"10.1109/BigData.2016.7840826","primary_category":"cs.LG","categories":"cs.LG"} {"id":"1701.00752v1","submitted":"2017-01-03 17:29:11","updated":"2017-01-03 17:29:11","title":"Privacy-Preserving Data Analysis for the Federal Statistical Agencies","abstract":" Government statistical agencies collect enormously valuable data on the\nnation's population and business activities. Wide access to these data enables\nevidence-based policy making, supports new research that improves society,\nfacilitates training for students in data science, and provides resources for\nthe public to better understand and participate in their society. These data\nalso affect the private sector. For example, the Employment Situation in the\nUnited States, published by the Bureau of Labor Statistics, moves markets.\nNonetheless, government agencies are under increasing pressure to limit access\nto data because of a growing understanding of the threats to data privacy and\nconfidentiality.\n \"De-identification\" - stripping obvious identifiers like names, addresses,\nand identification numbers - has been found inadequate in the face of modern\ncomputational and informational resources.\n Unfortunately, the problem extends even to the release of aggregate data\nstatistics. This counter-intuitive phenomenon has come to be known as the\nFundamental Law of Information Recovery. It says that overly accurate estimates\nof too many statistics can completely destroy privacy. One may think of this as\ndeath by a thousand cuts. Every statistic computed from a data set leaks a\nsmall amount of information about each member of the data set - a tiny cut.\nThis is true even if the exact value of the statistic is distorted a bit in\norder to preserve privacy. But while each statistical release is an almost\nharmless little cut in terms of privacy risk for any individual, the cumulative\neffect can be to completely compromise the privacy of some individuals.\n","authors":"John Abowd|Lorenzo Alvisi|Cynthia Dwork|Sampath Kannan|Ashwin Machanavajjhala|Jerome Reiter","affiliations":"","link_abstract":"http://arxiv.org/abs/1701.00752v1","link_pdf":"http://arxiv.org/pdf/1701.00752v1","link_doi":"","comment":"A Computing Community Consortium (CCC) white paper, 7 pages","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.CR"} {"id":"1701.01775v1","submitted":"2017-01-06 22:48:57","updated":"2017-01-06 22:48:57","title":"From Sky to Earth: Data Science Methodology Transfer","abstract":" We describe here the parallels in astronomy and earth science datasets, their\nanalyses, and the opportunities for methodology transfer from astroinformatics\nto geoinformatics. Using example of hydrology, we emphasize how meta-data and\nontologies are crucial in such an undertaking. Using the infrastructure being\ndesigned for EarthCube - the Virtual Observatory for the earth sciences - we\ndiscuss essential steps for better transfer of tools and techniques in the\nfuture e.g. domain adaptation. Finally we point out that it is never a one-way\nprocess and there is enough for astroinformatics to learn from geoinformatics\nas well.\n","authors":"Ashish A. Mahabal|Daniel Crichton|S. G. Djorgovski|Emily Law|John S. Hughes","affiliations":"","link_abstract":"http://arxiv.org/abs/1701.01775v1","link_pdf":"http://arxiv.org/pdf/1701.01775v1","link_doi":"http://dx.doi.org/10.1017/S1743921317000060","comment":"10 pages, 5 figures, IAU Symposium 325, \"Astroinformatics\"","journal_ref":"","doi":"10.1017/S1743921317000060","primary_category":"astro-ph.IM","categories":"astro-ph.IM"} {"id":"1701.05386v1","submitted":"2017-01-19 12:19:34","updated":"2017-01-19 12:19:34","title":"Application of data science techniques to disentangle X-ray spectral\n variation of super-massive black holes","abstract":" We apply three data science techniques, Nonnegative Matrix Factorization\n(NMF), Principal Component Analysis (PCA) and Independent Component Analysis\n(ICA), to simulated X-ray energy spectra of a particular class of super-massive\nblack holes. Two competing physical models, one whose variable components are\nadditive and the other whose variable components are multiplicative, are known\nto successfully describe X-ray spectral variation of these super-massive black\nholes, within accuracy of the contemporary observation. We hope to utilize\nthese techniques to compare the viability of the models by probing the\nmathematical structure of the observed spectra, while comparing advantages and\ndisadvantages of each technique. We find that PCA is best to determine the\ndimensionality of a dataset, while NMF is better suited for interpreting\nspectral components and comparing them in terms of the physical models in\nquestion. ICA is able to reconstruct the parameters responsible for spectral\nvariation. In addition, we find that the results of these techniques are\nsufficiently different that applying them to observed data may be a useful test\nin comparing the accuracy of the two spectral models.\n","authors":"S. Pike|K. Ebisawa|S. Ikeda|M. Morii|M. Mizumoto|E. Kusunoki","affiliations":"JAXA/ISAS|JAXA/ISAS|Institute of Statistical Mathematics|Institute of Statistical Mathematics|Univ. of Tokyo|Univ. of Tokyo","link_abstract":"http://arxiv.org/abs/1701.05386v1","link_pdf":"http://arxiv.org/pdf/1701.05386v1","link_doi":"","comment":"Journal of Space Science Informatics Japan, volume 6, 2017 (JAXA-RR),\n accepted","journal_ref":"","doi":"","primary_category":"astro-ph.IM","categories":"astro-ph.IM|astro-ph.HE"} {"id":"1701.06953v1","submitted":"2017-01-19 13:53:28","updated":"2017-01-19 13:53:28","title":"Pore-geometry recognition: on the importance of quantifying similarity\n in nanoporous materials","abstract":" In most applications of nanoporous materials the pore structure is as\nimportant as the chemical composition as a determinant of performance. For\nexample, one can alter performance in applications like carbon capture or\nmethane storage by orders of magnitude by only modifying the pore structure\n(1,2). For these applications it is therefore important to identify the optimal\npore geometry and use this information to find similar materials. However, the\nmathematical language and tools to identify materials with similar pore\nstructures, but different composition, has been lacking. Here we develop a pore\nrecognition approach to quantify similarity of pore structures and classify\nthem using topological data analysis (3,4). Our approach allows us to identify\nmaterials with similar pore geometries, and to screen for materials that are\nsimilar to given top-performing structures. Using methane storage as a case\nstudy, we also show that materials can be divided into topologically distinct\nclasses -- and that each class requires different optimization strategies. In\nthis work we have focused on pore space, but our topological approach can be\ngeneralised to quantify similarity of any geometric object, which, given the\nmany different Materials Genomics initiatives (5,6), opens many interesting\navenues for big-data science.\n","authors":"Yongjin Lee|Senja D. Barthel|Paweł Dłotko|S. Mohamad Moosavi|Kathryn Hess|Berend Smit","affiliations":"","link_abstract":"http://arxiv.org/abs/1701.06953v1","link_pdf":"http://arxiv.org/pdf/1701.06953v1","link_doi":"","comment":"20 pages, 18 pages supplementary information, 13 Figures, 3 tables","journal_ref":"","doi":"","primary_category":"cond-mat.mtrl-sci","categories":"cond-mat.mtrl-sci|math.AT"} {"id":"1701.05632v1","submitted":"2017-01-19 22:35:46","updated":"2017-01-19 22:35:46","title":"The Internet as Quantitative Social Science Platform: Insights from a\n Trillion Observations","abstract":" With the large-scale penetration of the internet, for the first time,\nhumanity has become linked by a single, open, communications platform.\nHarnessing this fact, we report insights arising from a unified internet\nactivity and location dataset of an unparalleled scope and accuracy drawn from\nover a trillion (1.5$\\times 10^{12}$) observations of end-user internet\nconnections, with temporal resolution of just 15min over 2006-2012. We first\napply this dataset to the expansion of the internet itself over 1,647 urban\nagglomerations globally. We find that unique IP per capita counts reach\nsaturation at approximately one IP per three people, and take, on average, 16.1\nyears to achieve; eclipsing the estimated 100- and 60- year saturation times\nfor steam-power and electrification respectively. Next, we use intra-diurnal\ninternet activity features to up-scale traditional over-night sleep\nobservations, producing the first global estimate of over-night sleep duration\nin 645 cities over 7 years. We find statistically significant variation between\ncontinental, national and regional sleep durations including some evidence of\nglobal sleep duration convergence. Finally, we estimate the relationship\nbetween internet concentration and economic outcomes in 411 OECD regions and\nfind that the internet's expansion is associated with negative or positive\nproductivity gains, depending strongly on sectoral considerations. To our\nknowledge, our study is the first of its kind to use online/offline activity of\nthe entire internet to infer social science insights, demonstrating the\nunparalleled potential of the internet as a social data-science platform.\n","authors":"Klaus Ackermann|Simon D Angus|Paul A Raschky","affiliations":"","link_abstract":"http://arxiv.org/abs/1701.05632v1","link_pdf":"http://arxiv.org/pdf/1701.05632v1","link_doi":"","comment":"40 pages, including 4 main figures, and appendix","journal_ref":"","doi":"","primary_category":"q-fin.EC","categories":"q-fin.EC|cs.CY|cs.SI|physics.soc-ph|stat.ML"} {"id":"1701.06482v2","submitted":"2017-01-23 16:30:52","updated":"2017-01-31 10:50:05","title":"Surgical Data Science: Enabling Next-Generation Surgery","abstract":" This paper introduces Surgical Data Science as an emerging scientific\ndiscipline. Key perspectives are based on discussions during an intensive\ntwo-day international interactive workshop that brought together leading\nresearchers working in the related field of computer and robot assisted\ninterventions. Our consensus opinion is that increasing access to large amounts\nof complex data, at scale, throughout the patient care process, complemented by\nadvances in data science and machine learning techniques, has set the stage for\na new generation of analytics that will support decision-making and quality\nimprovement in interventional medicine. In this article, we provide a consensus\ndefinition for Surgical Data Science, identify associated challenges and\nopportunities and provide a roadmap for advancing the field.\n","authors":"Lena Maier-Hein|Swaroop Vedula|Stefanie Speidel|Nassir Navab|Ron Kikinis|Adrian Park|Matthias Eisenmann|Hubertus Feussner|Germain Forestier|Stamatia Giannarou|Makoto Hashizume|Darko Katic|Hannes Kenngott|Michael Kranzfelder|Anand Malpani|Keno März|Thomas Neumuth|Nicolas Padoy|Carla Pugh|Nicolai Schoch|Danail Stoyanov|Russell Taylor|Martin Wagner|Gregory D. Hager|Pierre Jannin","affiliations":"","link_abstract":"http://arxiv.org/abs/1701.06482v2","link_pdf":"http://arxiv.org/pdf/1701.06482v2","link_doi":"http://dx.doi.org/10.1038/s41551-017-0132-7","comment":"10 pages, 2 figures, White paper corresponding to\n http://www.surgical-data-science.org/workshop2016","journal_ref":"Nature Biomedical Engineering 2017","doi":"10.1038/s41551-017-0132-7","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1701.08716v2","submitted":"2017-01-25 17:07:33","updated":"2017-03-24 18:14:26","title":"Does Weather Matter? Causal Analysis of TV Logs","abstract":" Weather affects our mood and behaviors, and many aspects of our life. When it\nis sunny, most people become happier; but when it rains, some people get\ndepressed. Despite this evidence and the abundance of data, weather has mostly\nbeen overlooked in the machine learning and data science research. This work\npresents a causal analysis of how weather affects TV watching patterns. We show\nthat some weather attributes, such as pressure and precipitation, cause major\nchanges in TV watching patterns. To the best of our knowledge, this is the\nfirst large-scale causal study of the impact of weather on TV watching\npatterns.\n","authors":"Shi Zong|Branislav Kveton|Shlomo Berkovsky|Azin Ashkan|Nikos Vlassis|Zheng Wen","affiliations":"","link_abstract":"http://arxiv.org/abs/1701.08716v2","link_pdf":"http://arxiv.org/pdf/1701.08716v2","link_doi":"","comment":"Companion of the 26th International World Wide Web Conference","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.LG"} {"id":"1701.07535v2","submitted":"2017-01-26 01:20:36","updated":"2017-12-14 08:18:02","title":"Stratified Splitting for Efficient Monte Carlo Integration","abstract":" The efficient evaluation of high-dimensional integrals is of importance in\nboth theoretical and practical fields of science, such as data science,\nstatistical physics, and machine learning. However, exact computation methods\nsuffer from the curse of dimensionality. However, due to the curse of\ndimensionality, deterministic numerical methods are inefficient in\nhigh-dimensional settings. Consequentially, for many practical problems, one\nmust resort to Monte Carlo estimation. In this paper, we introduce a novel\nSequential Monte Carlo technique called Stratified Splitting. The method\nprovides unbiased estimates and can handle various integrand types including\nindicator functions, which are used in rare-event probability estimation\nproblems. Moreover, we demonstrate that a variant of the algorithm can achieve\npolynomial complexity. The results of our numerical experiments suggest that\nthe Stratified Splitting method is capable of delivering accurate results for a\nvariety of integration problems.\n","authors":"Radislav Vaisman|Robert Salomone|Dirk P. Kroese","affiliations":"","link_abstract":"http://arxiv.org/abs/1701.07535v2","link_pdf":"http://arxiv.org/pdf/1701.07535v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|stat.TH"} {"id":"1702.01522v4","submitted":"2017-02-06 07:53:10","updated":"2017-11-06 08:12:22","title":"Inverse statistical problems: from the inverse Ising problem to data\n science","abstract":" Inverse problems in statistical physics are motivated by the challenges of\n`big data' in different fields, in particular high-throughput experiments in\nbiology. In inverse problems, the usual procedure of statistical physics needs\nto be reversed: Instead of calculating observables on the basis of model\nparameters, we seek to infer parameters of a model based on observations. In\nthis review, we focus on the inverse Ising problem and closely related\nproblems, namely how to infer the coupling strengths between spins given\nobserved spin correlations, magnetisations, or other data. We review\napplications of the inverse Ising problem, including the reconstruction of\nneural connections, protein structure determination, and the inference of gene\nregulatory networks. For the inverse Ising problem in equilibrium, a number of\ncontrolled and uncontrolled approximate solutions have been developed in the\nstatistical mechanics community. A particularly strong method,\npseudolikelihood, stems from statistics. We also review the inverse Ising\nproblem in the non-equilibrium case, where the model parameters must be\nreconstructed based on non-equilibrium statistics.\n","authors":"H. Chau Nguyen|Riccardo Zecchina|Johannes Berg","affiliations":"","link_abstract":"http://arxiv.org/abs/1702.01522v4","link_pdf":"http://arxiv.org/pdf/1702.01522v4","link_doi":"http://dx.doi.org/10.1080/00018732.2017.1341604","comment":"Review article, 45 pages","journal_ref":"Advances in Physics, 66 (3), 197-261 (2017)","doi":"10.1080/00018732.2017.1341604","primary_category":"cond-mat.dis-nn","categories":"cond-mat.dis-nn|q-bio.GN|q-bio.MN|q-bio.NC"} {"id":"1702.01780v1","submitted":"2017-02-06 20:10:10","updated":"2017-02-06 20:10:10","title":"Toward the automated analysis of complex diseases in genome-wide\n association studies using genetic programming","abstract":" Machine learning has been gaining traction in recent years to meet the demand\nfor tools that can efficiently analyze and make sense of the ever-growing\ndatabases of biomedical data in health care systems around the world. However,\neffectively using machine learning methods requires considerable domain\nexpertise, which can be a barrier of entry for bioinformaticians new to\ncomputational data science methods. Therefore, off-the-shelf tools that make\nmachine learning more accessible can prove invaluable for bioinformaticians. To\nthis end, we have developed an open source pipeline optimization tool\n(TPOT-MDR) that uses genetic programming to automatically design machine\nlearning pipelines for bioinformatics studies. In TPOT-MDR, we implement\nMultifactor Dimensionality Reduction (MDR) as a feature construction method for\nmodeling higher-order feature interactions, and combine it with a new expert\nknowledge-guided feature selector for large biomedical data sets. We\ndemonstrate TPOT-MDR's capabilities using a combination of simulated and real\nworld data sets from human genetics and find that TPOT-MDR significantly\noutperforms modern machine learning methods such as logistic regression and\neXtreme Gradient Boosting (XGBoost). We further analyze the best pipeline\ndiscovered by TPOT-MDR for a real world problem and highlight TPOT-MDR's\nability to produce a high-accuracy solution that is also easily interpretable.\n","authors":"Andrew Sohn|Randal S. Olson|Jason H. Moore","affiliations":"","link_abstract":"http://arxiv.org/abs/1702.01780v1","link_pdf":"http://arxiv.org/pdf/1702.01780v1","link_doi":"","comment":"9 pages, 4 figures, submitted to GECCO 2017 conference and currently\n under review","journal_ref":"","doi":"","primary_category":"cs.NE","categories":"cs.NE|cs.LG|q-bio.QM|stat.ML"} {"id":"1702.02680v1","submitted":"2017-02-09 02:19:24","updated":"2017-02-09 02:19:24","title":"Manifold Based Low-rank Regularization for Image Restoration and\n Semi-supervised Learning","abstract":" Low-rank structures play important role in recent advances of many problems\nin image science and data science. As a natural extension of low-rank\nstructures for data with nonlinear structures, the concept of the\nlow-dimensional manifold structure has been considered in many data processing\nproblems. Inspired by this concept, we consider a manifold based low-rank\nregularization as a linear approximation of manifold dimension. This\nregularization is less restricted than the global low-rank regularization, and\nthus enjoy more flexibility to handle data with nonlinear structures. As\napplications, we demonstrate the proposed regularization to classical inverse\nproblems in image sciences and data sciences including image inpainting, image\nsuper-resolution, X-ray computer tomography (CT) image reconstruction and\nsemi-supervised learning. We conduct intensive numerical experiments in several\nimage restoration problems and a semi-supervised learning problem of\nclassifying handwritten digits using the MINST data. Our numerical tests\ndemonstrate the effectiveness of the proposed methods and illustrate that the\nnew regularization methods produce outstanding results by comparing with many\nexisting methods.\n","authors":"Rongjie Lai|Jia Li","affiliations":"","link_abstract":"http://arxiv.org/abs/1702.02680v1","link_pdf":"http://arxiv.org/pdf/1702.02680v1","link_doi":"","comment":"23 pages, 13 figures","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV|math.NA|65D18, 65J22, 68U10, 68Q32"} {"id":"1702.02799v1","submitted":"2017-02-09 12:06:37","updated":"2017-02-09 12:06:37","title":"UStore: A Distributed Storage With Rich Semantics","abstract":" Today's storage systems expose abstractions which are either too low-level\n(e.g., key-value store, raw-block store) that they require developers to\nre-invent the wheels, or too high-level (e.g., relational databases, Git) that\nthey lack generality to support many classes of applications. In this work, we\npropose and implement a general distributed data storage system, called UStore,\nwhich has rich semantics. UStore delivers three key properties, namely\nimmutability, sharing and security, which unify and add values to many classes\nof today's applications, and which also open the door for new applications. By\nkeeping the core properties within the storage, UStore helps reduce application\ndevelopment efforts while offering high performance at hand. The storage\nembraces current hardware trends as key enablers. It is built around a\ndata-structure similar to that of Git, a popular source code versioning system,\nbut it also synthesizes many designs from distributed systems and databases.\nOur current implementation of UStore has better performance than general\nin-memory key-value storage systems, especially for version scan operations. We\nport and evaluate four applications on top of UStore: a Git-like application, a\ncollaborative data science application, a transaction management application,\nand a blockchain application. We demonstrate that UStore enables faster\ndevelopment and the UStore-backed applications can have better performance than\nthe existing implementations.\n","authors":"Anh Dinh|Ji Wang|Sheng Wang|Gang Chen|Wei-Ngan Chin|Qian Lin|Beng Chin Ooi|Pingcheng Ruan|Kian-Lee Tan|Zhongle Xie|Hao Zhang|Meihui Zhang","affiliations":"","link_abstract":"http://arxiv.org/abs/1702.02799v1","link_pdf":"http://arxiv.org/pdf/1702.02799v1","link_doi":"","comment":"21 pages","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB|cs.DC"} {"id":"1702.03825v1","submitted":"2017-02-10 07:47:48","updated":"2017-02-10 07:47:48","title":"Analyzing and Visualizing Scalar Fields on Graphs","abstract":" The value proposition of a dataset often resides in the implicit\ninterconnections or explicit relationships (patterns) among individual\nentities, and is often modeled as a graph. Effective visualization of such\ngraphs can lead to key insights uncovering such value. In this article we\npropose a visualization method to explore graphs with numerical attributes\nassociated with nodes (or edges) -- referred to as scalar graphs. Such\nnumerical attributes can represent raw content information, similarities, or\nderived information reflecting important network measures such as triangle\ndensity and centrality. The proposed visualization strategy seeks to\nsimultaneously uncover the relationship between attribute values and graph\ntopology, and relies on transforming the network to generate a terrain map. A\nkey objective here is to ensure that the terrain map reveals the overall\ndistribution of components-of-interest (e.g. dense subgraphs, k-cores) and the\nrelationships among them while being sensitive to the attribute values over the\ngraph. We also design extensions that can capture the relationship across\nmultiple numerical attributes (scalars). We demonstrate the efficacy of our\nmethod on several real-world data science tasks while scaling to large graphs\nwith millions of nodes.\n","authors":"Yang Zhang|Yusu Wang|Srinivasan Parthasarathy","affiliations":"","link_abstract":"http://arxiv.org/abs/1702.03825v1","link_pdf":"http://arxiv.org/pdf/1702.03825v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB"} {"id":"1702.03120v1","submitted":"2017-02-10 10:21:28","updated":"2017-02-10 10:21:28","title":"Deterministic entanglement generation from driving through quantum phase\n transitions","abstract":" Many-body entanglement is often created through system evolution, aided by\nnon-linear interactions between the constituting particles. The very dynamics,\nhowever, can also lead to fluctuations and degradation of the entanglement if\nthe interactions cannot be controlled. Here, we demonstrate near-deterministic\ngeneration of an entangled twin-Fock condensate of $\\sim11000$ atoms by driving\na $^{87}$Rb Bose-Einstein condensate undergoing spin mixing through two\nconsecutive quantum phase transitions (QPTs). We directly observe number\nsqueezing of $10.7\\pm0.6$ dB and normalized collective spin length of\n$0.99\\pm0.01$. Together, these observations allow us to infer an\nentanglement-enhanced phase sensitivity of $\\sim6$ dB beyond the standard\nquantum limit and an entanglement breadth of $\\sim910$ atoms. Our work\nhighlights the power of generating large-scale useful entanglement by taking\nadvantage of the different entanglement landscapes separated by QPTs.\n","authors":"Xin-Yu Luo|Yi-Quan Zou|Ling-Na Wu|Qi Liu|Ming-Fei Han|Meng Khoon Tey|Li You","affiliations":"","link_abstract":"http://arxiv.org/abs/1702.03120v1","link_pdf":"http://arxiv.org/pdf/1702.03120v1","link_doi":"http://dx.doi.org/10.1126/science.aag1106","comment":"Supplementary materials can be found at\n http://science.sciencemag.org/content/355/6325/620/tab-figures-data","journal_ref":"Science 355, 620-623 (2017)","doi":"10.1126/science.aag1106","primary_category":"cond-mat.quant-gas","categories":"cond-mat.quant-gas"} {"id":"1702.04371v1","submitted":"2017-02-14 20:03:51","updated":"2017-02-14 20:03:51","title":"Reconstruction of Galaxy Star Formation Histories through SED Fitting:\n The Dense Basis Approach","abstract":" We introduce the Dense Basis method for Spectral Energy Distribution (SED)\nfitting. It accurately recovers traditional SED parameters, including M$_*$,\nSFR and dust attenuation, and reveals previously inaccessible information about\nthe number and duration of star formation episodes and the timing of stellar\nmass assembly, as well as uncertainties in these quantities. This is done using\nbasis Star Formation Histories (SFHs) chosen by comparing the goodness-of-fit\nof mock galaxy SEDs to the goodness-of-reconstruction of their SFHs. We train\nand validate the method using a sample of realistic SFHs at $z =1$ drawn from\nstochastic realisations, semi-analytic models, and a cosmological\nhydrodynamical galaxy formation simulation. The method is then applied to a\nsample of 1100 CANDELS GOODS-S galaxies at $110^{9.5}M_\\odot$. About $40\\%$ of the CANDELS galaxies\nhave SFHs whose maximum occurs at or near the epoch of observation. The Dense\nBasis method is scalable and offers a general approach to a broad class of\ndata-science problems.\n","authors":"Kartheik G. Iyer|Eric Gawiser","affiliations":"","link_abstract":"http://arxiv.org/abs/1702.04371v1","link_pdf":"http://arxiv.org/pdf/1702.04371v1","link_doi":"http://dx.doi.org/10.3847/1538-4357/aa63f0","comment":"27 pages including 20 color figures, revised in response to referee's\n report","journal_ref":"","doi":"10.3847/1538-4357/aa63f0","primary_category":"astro-ph.GA","categories":"astro-ph.GA"} {"id":"1702.06151v1","submitted":"2017-02-20 19:22:21","updated":"2017-02-20 19:22:21","title":"Developing a comprehensive framework for multimodal feature extraction","abstract":" Feature extraction is a critical component of many applied data science\nworkflows. In recent years, rapid advances in artificial intelligence and\nmachine learning have led to an explosion of feature extraction tools and\nservices that allow data scientists to cheaply and effectively annotate their\ndata along a vast array of dimensions---ranging from detecting faces in images\nto analyzing the sentiment expressed in coherent text. Unfortunately, the\nproliferation of powerful feature extraction services has been mirrored by a\ncorresponding expansion in the number of distinct interfaces to feature\nextraction services. In a world where nearly every new service has its own API,\ndocumentation, and/or client library, data scientists who need to combine\ndiverse features obtained from multiple sources are often forced to write and\nmaintain ever more elaborate feature extraction pipelines. To address this\nchallenge, we introduce a new open-source framework for comprehensive\nmultimodal feature extraction. Pliers is an open-source Python package that\nsupports standardized annotation of diverse data types (video, images, audio,\nand text), and is expressly with both ease-of-use and extensibility in mind.\nUsers can apply a wide range of pre-existing feature extraction tools to their\ndata in just a few lines of Python code, and can also easily add their own\ncustom extractors by writing modular classes. A graph-based API enables rapid\ndevelopment of complex feature extraction pipelines that output results in a\nsingle, standardized format. We describe the package's architecture, detail its\nmajor advantages over previous feature extraction toolboxes, and use a sample\napplication to a large functional MRI dataset to illustrate how pliers can\nsignificantly reduce the time and effort required to construct sophisticated\nfeature extraction workflows while increasing code clarity and maintainability.\n","authors":"Quinten McNamara|Alejandro de la Vega|Tal Yarkoni","affiliations":"","link_abstract":"http://arxiv.org/abs/1702.06151v1","link_pdf":"http://arxiv.org/pdf/1702.06151v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV|cs.IR|cs.LG|cs.MM"} {"id":"1703.00060v2","submitted":"2017-02-28 21:20:19","updated":"2018-03-28 19:59:00","title":"Achieving non-discrimination in prediction","abstract":" Discrimination-aware classification is receiving an increasing attention in\ndata science fields. The pre-process methods for constructing a\ndiscrimination-free classifier first remove discrimination from the training\ndata, and then learn the classifier from the cleaned data. However, they lack a\ntheoretical guarantee for the potential discrimination when the classifier is\ndeployed for prediction. In this paper, we fill this gap by mathematically\nbounding the probability of the discrimination in prediction being within a\ngiven interval in terms of the training data and classifier. We adopt the\ncausal model for modeling the data generation mechanism, and formally defining\ndiscrimination in population, in a dataset, and in prediction. We obtain two\nimportant theoretical results: (1) the discrimination in prediction can still\nexist even if the discrimination in the training data is completely removed;\nand (2) not all pre-process methods can ensure non-discrimination in prediction\neven though they can achieve non-discrimination in the modified training data.\nBased on the results, we develop a two-phase framework for constructing a\ndiscrimination-free classifier with a theoretical guarantee. The experiments\ndemonstrate the theoretical results and show the effectiveness of our two-phase\nframework.\n","authors":"Lu Zhang|Yongkai Wu|Xintao Wu","affiliations":"University of Arkansas|University of Arkansas|University of Arkansas","link_abstract":"http://arxiv.org/abs/1703.00060v2","link_pdf":"http://arxiv.org/pdf/1703.00060v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1703.01601v2","submitted":"2017-03-05 15:00:10","updated":"2018-04-21 20:34:28","title":"Doing Things Twice (Or Differently): Strategies to Identify Studies for\n Targeted Validation","abstract":" The \"reproducibility crisis\" has been a highly visible source of scientific\ncontroversy and dispute. Here, I propose and review several avenues for\nidentifying and prioritizing research studies for the purpose of targeted\nvalidation. Of the various proposals discussed, I identify scientific data\nscience as being a strategy that merits greater attention among those\ninterested in reproducibility. I argue that the tremendous potential of\nscientific data science for uncovering high-value research studies is a\nsignificant and rarely discussed benefit of the transition to a fully\nopen-access publishing model.\n","authors":"Gopal P. Sarma","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.01601v2","link_pdf":"http://arxiv.org/pdf/1703.01601v2","link_doi":"","comment":"4 pages","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.DL|physics.soc-ph|q-bio.OT"} {"id":"1703.02475v1","submitted":"2017-03-07 17:09:13","updated":"2017-03-07 17:09:13","title":"OrpheusDB: Bolt-on Versioning for Relational Databases","abstract":" Data science teams often collaboratively analyze datasets, generating dataset\nversions at each stage of iterative exploration and analysis. There is a\npressing need for a system that can support dataset versioning, enabling such\nteams to efficiently store, track, and query across dataset versions. We\nintroduce OrpheusDB, a dataset version control system that \"bolts on\"\nversioning capabilities to a traditional relational database system, thereby\ngaining the analytics capabilities of the database \"for free\". We develop and\nevaluate multiple data models for representing versioned data, as well as a\nlight-weight partitioning scheme, LyreSplit, to further optimize the models for\nreduced query latencies. With LyreSplit, OrpheusDB is on average 1000x faster\nin finding effective (and better) partitionings than competing approaches,\nwhile also reducing the latency of version retrieval by up to 20x relative to\nschemes without partitioning. LyreSplit can be applied in an online fashion as\nnew versions are added, alongside an intelligent migration scheme that reduces\nmigration time by 10x on average.\n","authors":"Silu Huang|Liqi Xu|Jialin Liu|Aaron Elmore|Aditya Parameswaran","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.02475v1","link_pdf":"http://arxiv.org/pdf/1703.02475v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB"} {"id":"1703.02930v3","submitted":"2017-03-08 17:35:17","updated":"2017-10-16 01:29:59","title":"Nearly-tight VC-dimension and pseudodimension bounds for piecewise\n linear neural networks","abstract":" We prove new upper and lower bounds on the VC-dimension of deep neural\nnetworks with the ReLU activation function. These bounds are tight for almost\nthe entire range of parameters. Letting $W$ be the number of weights and $L$ be\nthe number of layers, we prove that the VC-dimension is $O(W L \\log(W))$, and\nprovide examples with VC-dimension $\\Omega( W L \\log(W/L) )$. This improves\nboth the previously known upper bounds and lower bounds. In terms of the number\n$U$ of non-linear units, we prove a tight bound $\\Theta(W U)$ on the\nVC-dimension. All of these bounds generalize to arbitrary piecewise linear\nactivation functions, and also hold for the pseudodimensions of these function\nclasses.\n Combined with previous results, this gives an intriguing range of\ndependencies of the VC-dimension on depth for networks with different\nnon-linearities: there is no dependence for piecewise-constant, linear\ndependence for piecewise-linear, and no more than quadratic dependence for\ngeneral piecewise-polynomial.\n","authors":"Peter L. Bartlett|Nick Harvey|Chris Liaw|Abbas Mehrabian","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.02930v3","link_pdf":"http://arxiv.org/pdf/1703.02930v3","link_doi":"","comment":"Extended abstract appeared in COLT 2017; the upper bound was\n presented at the 2016 ACM Conference on Data Science. This version includes\n all the proofs and a refinement of the upper bound, Theorem 6. 16 pages, 2\n figures","journal_ref":"Journal of Machine Learning Research 20 (2019) 1-17","doi":"","primary_category":"cs.LG","categories":"cs.LG"} {"id":"1703.03076v2","submitted":"2017-03-08 23:54:09","updated":"2018-04-14 14:08:44","title":"Causal Data Science for Financial Stress Testing","abstract":" The most recent financial upheavals have cast doubt on the adequacy of some\nof the conventional quantitative risk management strategies, such as VaR (Value\nat Risk), in many common situations. Consequently, there has been an increasing\nneed for verisimilar financial stress testings, namely simulating and analyzing\nfinancial portfolios in extreme, albeit rare scenarios. Unlike conventional\nrisk management which exploits statistical correlations among financial\ninstruments, here we focus our analysis on the notion of probabilistic\ncausation, which is embodied by Suppes-Bayes Causal Networks (SBCNs); SBCNs are\nprobabilistic graphical models that have many attractive features in terms of\nmore accurate causal analysis for generating financial stress scenarios. In\nthis paper, we present a novel approach for conducting stress testing of\nfinancial portfolios based on SBCNs in combination with classical machine\nlearning classification tools. The resulting method is shown to be capable of\ncorrectly discovering the causal relationships among financial factors that\naffect the portfolios and thus, simulating stress testing scenarios with a\nhigher accuracy and lower computational complexity than conventional Monte\nCarlo Simulations.\n","authors":"Gelin Gao|Bud Mishra|Daniele Ramazzotti","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.03076v2","link_pdf":"http://arxiv.org/pdf/1703.03076v2","link_doi":"http://dx.doi.org/10.1016/j.jocs.2018.04.003","comment":"","journal_ref":"","doi":"10.1016/j.jocs.2018.04.003","primary_category":"cs.LG","categories":"cs.LG|cs.AI|cs.CE"} {"id":"1703.03869v1","submitted":"2017-03-10 23:26:33","updated":"2017-03-10 23:26:33","title":"Deep Learning in Customer Churn Prediction: Unsupervised Feature\n Learning on Abstract Company Independent Feature Vectors","abstract":" As companies increase their efforts in retaining customers, being able to\npredict accurately ahead of time, whether a customer will churn in the\nforeseeable future is an extremely powerful tool for any marketing team. The\npaper describes in depth the application of Deep Learning in the problem of\nchurn prediction. Using abstract feature vectors, that can generated on any\nsubscription based company's user event logs, the paper proves that through the\nuse of the intrinsic property of Deep Neural Networks (learning secondary\nfeatures in an unsupervised manner), the complete pipeline can be applied to\nany subscription based company with extremely good churn predictive\nperformance. Furthermore the research documented in the paper was performed for\nFramed Data (a company that sells churn prediction as a service for other\ncompanies) in conjunction with the Data Science Institute at Lancaster\nUniversity, UK. This paper is the intellectual property of Framed Data.\n","authors":"Philip Spanoudes|Thomson Nguyen","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.03869v1","link_pdf":"http://arxiv.org/pdf/1703.03869v1","link_doi":"","comment":"23 pages, 14 figures","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1703.04058v2","submitted":"2017-03-12 02:16:11","updated":"2017-08-03 02:45:42","title":"Think globally, fit locally under the Manifold Setup: Asymptotic\n Analysis of Locally Linear Embedding","abstract":" Since its introduction in 2000, the locally linear embedding (LLE) has been\nwidely applied in data science. We provide an asymptotical analysis of the LLE\nunder the manifold setup. We show that for the general manifold, asymptotically\nwe may not obtain the Laplace-Beltrami operator, and the result may depend on\nthe non-uniform sampling, unless a correct regularization is chosen. We also\nderive the corresponding kernel function, which indicates that the LLE is not a\nMarkov process. A comparison with the other commonly applied nonlinear\nalgorithms, particularly the diffusion map, is provided, and its relationship\nwith the locally linear regression is also discussed.\n","authors":"Hau-Tieng Wu|Nan Wu","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.04058v2","link_pdf":"http://arxiv.org/pdf/1703.04058v2","link_doi":"","comment":"78 pages, 4 figures. We add a short discussion about thr relation\n between espilon and the intrinsic geometry of the manifold. We add a new\n section about K nearest neighborhood (KNN) and a new subsection about error\n in variable. We provide more numerical examples","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|stat.TH|62-07"} {"id":"1703.04900v2","submitted":"2017-03-15 02:57:57","updated":"2017-06-19 23:12:25","title":"Portable learning environments for hands-on computational instruction:\n Using container- and cloud-based technology to teach data science","abstract":" There is an increasing interest in learning outside of the traditional\nclassroom setting. This is especially true for topics covering computational\ntools and data science, as both are challenging to incorporate in the standard\ncurriculum. These atypical learning environments offer new opportunities for\nteaching, particularly when it comes to combining conceptual knowledge with\nhands-on experience/expertise with methods and skills. Advances in cloud\ncomputing and containerized environments provide an attractive opportunity to\nimprove the efficiency and ease with which students can learn. This manuscript\ndetails recent advances towards using commonly-available cloud computing\nservices and advanced cyberinfrastructure support for improving the learning\nexperience in bootcamp-style events. We cover the benefits (and challenges) of\nusing a server hosted remotely instead of relying on student laptops, discuss\nthe technology that was used in order to make this possible, and give\nsuggestions for how others could implement and improve upon this model for\npedagogy and reproducibility.\n","authors":"Chris Holdgraf|Aaron Culich|Ariel Rokem|Fatma Deniz|Maryana Alegro|Dani Ushizima","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.04900v2","link_pdf":"http://arxiv.org/pdf/1703.04900v2","link_doi":"","comment":"Accepted at the PEARC 2017 conference in New Orleans, LA","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.HC"} {"id":"1703.06118v2","submitted":"2017-03-15 12:01:52","updated":"2017-11-03 14:35:30","title":"A Review on Flight Delay Prediction","abstract":" Flight delays hurt airlines, airports, and passengers. Their prediction is\ncrucial during the decision-making process for all players of commercial\naviation. Moreover, the development of accurate prediction models for flight\ndelays became cumbersome due to the complexity of air transportation system,\nthe number of methods for prediction, and the deluge of flight data. In this\ncontext, this paper presents a thorough literature review of approaches used to\nbuild flight delay prediction models from the Data Science perspective. We\npropose a taxonomy and summarize the initiatives used to address the flight\ndelay prediction problem, according to scope, data, and computational methods,\ngiving particular attention to an increased usage of machine learning methods.\nBesides, we also present a timeline of significant works that depicts\nrelationships between flight delay prediction problems and research trends to\naddress them.\n","authors":"Alice Sternberg|Jorge Soares|Diego Carvalho|Eduardo Ogasawara","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.06118v2","link_pdf":"http://arxiv.org/pdf/1703.06118v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.CE"} {"id":"1703.05830v5","submitted":"2017-03-16 21:35:15","updated":"2017-11-15 19:29:24","title":"Automatically identifying, counting, and describing wild animals in\n camera-trap images with deep learning","abstract":" Having accurate, detailed, and up-to-date information about the location and\nbehavior of animals in the wild would revolutionize our ability to study and\nconserve ecosystems. We investigate the ability to automatically, accurately,\nand inexpensively collect such data, which could transform many fields of\nbiology, ecology, and zoology into \"big data\" sciences. Motion sensor \"camera\ntraps\" enable collecting wildlife pictures inexpensively, unobtrusively, and\nfrequently. However, extracting information from these pictures remains an\nexpensive, time-consuming, manual task. We demonstrate that such information\ncan be automatically extracted by deep learning, a cutting-edge type of\nartificial intelligence. We train deep convolutional neural networks to\nidentify, count, and describe the behaviors of 48 species in the\n3.2-million-image Snapshot Serengeti dataset. Our deep neural networks\nautomatically identify animals with over 93.8% accuracy, and we expect that\nnumber to improve rapidly in years to come. More importantly, if our system\nclassifies only images it is confident about, our system can automate animal\nidentification for 99.3% of the data while still performing at the same 96.6%\naccuracy as that of crowdsourced teams of human volunteers, saving more than\n8.4 years (at 40 hours per week) of human labeling effort (i.e. over 17,000\nhours) on this 3.2-million-image dataset. Those efficiency gains immediately\nhighlight the importance of using deep neural networks to automate data\nextraction from camera-trap images. Our results suggest that this technology\ncould enable the inexpensive, unobtrusive, high-volume, and even real-time\ncollection of a wealth of information about vast numbers of animals in the\nwild.\n","authors":"Mohammed Sadegh Norouzzadeh|Anh Nguyen|Margaret Kosmala|Ali Swanson|Meredith Palmer|Craig Packer|Jeff Clune","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.05830v5","link_pdf":"http://arxiv.org/pdf/1703.05830v5","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV|cs.LG"} {"id":"1703.06450v1","submitted":"2017-03-19 14:51:17","updated":"2017-03-19 14:51:17","title":"Building a Disciplinary, World-Wide Data Infrastructure","abstract":" Sharing scientific data, with the objective of making it fully discoverable,\naccessible, assessable, intelligible, usable, and interoperable, requires work\nat the disciplinary level to define in particular how the data should be\nformatted and described. Each discipline has its own organization and history\nas a starting point, and this paper explores the way a range of disciplines,\nnamely materials science, crystallography, astronomy, earth sciences,\nhumanities and linguistics get organized at the international level to tackle\nthis question. In each case, the disciplinary culture with respect to data\nsharing, science drivers, organization and lessons learnt are briefly\ndescribed, as well as the elements of the specific data infrastructure which\nare or could be shared with others. Commonalities and differences are assessed.\nCommon key elements for success are identified: data sharing should be science\ndriven; defining the disciplinary part of the interdisciplinary standards is\nmandatory but challenging; sharing of applications should accompany data\nsharing. Incentives such as journal and funding agency requirements are also\nsimilar. For all, it also appears that social aspects are more challenging than\ntechnological ones. Governance is more diverse, and linked to the discipline\norganization. CODATA, the RDA and the WDS can facilitate the establishment of\ndisciplinary interoperability frameworks. Being problem-driven is also a key\nfactor of success for building bridges to enable interdisciplinary research.\n","authors":"Françoise Genova|Christophe Arviset|Bridget M. Almas|Laura Bartolo|Daan Broeder|Emily Law|Brian McMahon","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.06450v1","link_pdf":"http://arxiv.org/pdf/1703.06450v1","link_doi":"http://dx.doi.org/10.5334/dsj-2017-016","comment":"Proceedings of the session \"Building a disciplinary, world-wide data\n infrastructure\" of SciDataCon 2016, held in Denver, CO, USA, 12-14 September\n 2016, to be published in ICSU CODATA Data Science Journal in 2017","journal_ref":"Data Science Journal, 16 p.16, 2017","doi":"10.5334/dsj-2017-016","primary_category":"astro-ph.IM","categories":"astro-ph.IM"} {"id":"1703.08593v2","submitted":"2017-03-24 20:52:32","updated":"2017-12-21 15:46:59","title":"Analyzing Evolving Stories in News Articles","abstract":" There is an overwhelming number of news articles published every day around\nthe globe. Following the evolution of a news-story is a difficult task given\nthat there is no such mechanism available to track back in time to study the\ndiffusion of the relevant events in digital news feeds. The techniques\ndeveloped so far to extract meaningful information from a massive corpus rely\non similarity search, which results in a myopic loopback to the same topic\nwithout providing the needed insights to hypothesize the origin of a story that\nmay be completely different than the news today. In this paper, we present an\nalgorithm that mines historical data to detect the origin of an event, segments\nthe timeline into disjoint groups of coherent news articles, and outlines the\nmost important documents in a timeline with a soft probability to provide a\nbetter understanding of the evolution of a story. Qualitative and quantitative\napproaches to evaluate our framework demonstrate that our algorithm discovers\nstatistically significant and meaningful stories in reasonable time.\nAdditionally, a relevant case study on a set of news articles demonstrates that\nthe generated output of the algorithm holds the promise to aid prediction of\nfuture entities in a story.\n","authors":"Roberto Camacho Barranco|Arnold P. Boedihardjo|M. Shahriar Hossain","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.08593v2","link_pdf":"http://arxiv.org/pdf/1703.08593v2","link_doi":"http://dx.doi.org/10.1007/s41060-017-0091-9","comment":"This is a pre-print of an article published in the International\n Journal of Data Science and Analytics. The final authenticated version is\n available online at: https://doi.org/10.1007/s41060-017-0091-9","journal_ref":"","doi":"10.1007/s41060-017-0091-9","primary_category":"cs.IR","categories":"cs.IR|cs.IT|math.IT"} {"id":"1703.08694v1","submitted":"2017-03-25 13:39:51","updated":"2017-03-25 13:39:51","title":"Toward Semantic Foundations for Program Editors","abstract":" Programming language definitions assign formal meaning to complete programs.\nProgrammers, however, spend a substantial amount of time interacting with\nincomplete programs -- programs with holes, type inconsistencies and binding\ninconsistencies -- using tools like program editors and live programming\nenvironments (which interleave editing and evaluation). Semanticists have done\ncomparatively little to formally characterize (1) the static and dynamic\nsemantics of incomplete programs; (2) the actions available to programmers as\nthey edit and inspect incomplete programs; and (3) the behavior of editor\nservices that suggest likely edit actions to the programmer based on semantic\ninformation extracted from the incomplete program being edited, and from\nprograms that the system has encountered in the past. As such, each tool\ndesigner has largely been left to develop their own ad hoc heuristics.\n This paper serves as a vision statement for a research program that seeks to\ndevelop these \"missing\" semantic foundations. Our hope is that these\ncontributions, which will take the form of a series of simple formal calculi\nequipped with a tractable metatheory, will guide the design of a variety of\ncurrent and future interactive programming tools, much as various lambda\ncalculi have guided modern language designs. Our own research will apply these\nprinciples in the design of Hazel, an experimental live lab notebook\nprogramming environment designed for data science tasks. We plan to co-design\nthe Hazel language with the editor so that we can explore concepts such as\nedit-time semantic conflict resolution mechanisms and mechanisms that allow\nlibrary providers to install library-specific editor services.\n","authors":"Cyrus Omar|Ian Voysey|Michael Hilton|Joshua Sunshine|Claire Le Goues|Jonathan Aldrich|Matthew A. Hammer","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.08694v1","link_pdf":"http://arxiv.org/pdf/1703.08694v1","link_doi":"","comment":"The 2nd Summit on Advances in Programming Languages (SNAPL 2017)","journal_ref":"","doi":"","primary_category":"cs.PL","categories":"cs.PL"} {"id":"1703.10146v1","submitted":"2017-03-29 17:21:44","updated":"2017-03-29 17:21:44","title":"Community detection and stochastic block models: recent developments","abstract":" The stochastic block model (SBM) is a random graph model with planted\nclusters. It is widely employed as a canonical model to study clustering and\ncommunity detection, and provides generally a fertile ground to study the\nstatistical and computational tradeoffs that arise in network and data\nsciences.\n This note surveys the recent developments that establish the fundamental\nlimits for community detection in the SBM, both with respect to\ninformation-theoretic and computational thresholds, and for various recovery\nrequirements such as exact, partial and weak recovery (a.k.a., detection). The\nmain results discussed are the phase transitions for exact recovery at the\nChernoff-Hellinger threshold, the phase transition for weak recovery at the\nKesten-Stigum threshold, the optimal distortion-SNR tradeoff for partial\nrecovery, the learning of the SBM parameters and the gap between\ninformation-theoretic and computational thresholds.\n The note also covers some of the algorithms developed in the quest of\nachieving the limits, in particular two-round algorithms via graph-splitting,\nsemi-definite programming, linearized belief propagation, classical and\nnonbacktracking spectral methods. A few open problems are also discussed.\n","authors":"Emmanuel Abbe","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.10146v1","link_pdf":"http://arxiv.org/pdf/1703.10146v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.PR","categories":"math.PR|cs.CC|cs.IT|cs.SI|math.IT|stat.ML"} {"id":"1703.10459v3","submitted":"2017-03-30 13:20:37","updated":"2019-04-14 17:34:16","title":"Random Discretization of the Finite Fourier Transform and Related Kernel\n Random Matrices","abstract":" This paper is centred on the spectral study of a Random Fourier matrix, that\nis an $n\\times n$ matrix $A$ whose $(j, k)$ entries are $\\exp(2i\\pi m X_jY_k)$,\nwith $X_j$ and $Y_k$ two i.i.d sequences of random variables and $1\\leq m\\leq\nn$ is a real number. When they are uniformly distributed on a symmetric\ninterval, this may be seen as a random discretization of the Finite Fourier\ntransform, whose spectrum has been extensively studied in relation with\nband-limited functions. Our study is two-fold. Firstly, by pushing forward\nconcentration inequalities, we find an accurate comparison in $\\ell^2$- norm\nbetween the spectrum of $A^*A$ and the one of an integral operator that can be\ndefined in terms of the two probability laws chosen for the rows and the\ncolumns. Our study includes the one of stationary Hermitian kernel matrices and\ncan be generalized to non stationary ones, for which the same kind of\ncomparison with an integral operator is possible. Because of possible\napplications in the data science area, these last matrices have been largely\nstudied in the literature and our results are compared with previous ones.\n Secondly we concentrate on uniform distributions for the laws of $X_j$'s and\n$Y_k$'s, for which the integral operator is the well-known Sinc-kernel operator\nwith parameter $m.$ Our previous study allows to translate to random Fourier\nmatrices the knowledge that we have on the spectrum of this operator. We have\nfor them asymptotic results for $m, n$ and $n/m$ tending to $\\infty$, as well\nas non asymptotic bounds in the spirit of recent work on the integral\noperators. As an application, we give fairly good approximations of the number\nof degrees of freedom and the capacity of a MIMO wireless communication network\napproximation model. Finally, we provide the reader with some numerical\nexamples that illustrate the theoretical results of this paper.\n","authors":"Aline Bonami|Abderrazek Karoui","affiliations":"MAPMO|","link_abstract":"http://arxiv.org/abs/1703.10459v3","link_pdf":"http://arxiv.org/pdf/1703.10459v3","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.CA","categories":"math.CA"} {"id":"1703.10832v2","submitted":"2017-03-31 10:14:27","updated":"2017-06-01 02:45:34","title":"Social dynamics of financial networks","abstract":" The global financial crisis in 2007-2009 demonstrated that systemic risk can\nspread all over the world through a complex web of financial linkages, yet we\nstill lack fundamental knowledge about the evolution of the financial web. In\nparticular, interbank credit networks shape the core of the financial system,\nin which a time-varying interconnected risk emerges from a massive number of\ntemporal transactions between banks. The current lack of understanding of the\nmechanics of interbank networks makes it difficult to evaluate and control\nsystemic risk. Here, we uncover fundamental dynamics of interbank networks by\nseeking the patterns of daily transactions between individual banks. We find\nstable interaction patterns between banks from which distinctive network-scale\ndynamics emerge. In fact, the dynamical patterns discovered at the local and\nnetwork scales share common characteristics with social communication patterns\nof humans. To explain the origin of \"social\" dynamics in interbank networks, we\nprovide a simple model that allows us to generate a sequence of synthetic daily\nnetworks characterized by the observed dynamical properties. The discovery of\ndynamical principles at the daily resolution will enhance our ability to assess\nsystemic risk and could contribute to the real-time management of financial\nstability.\n","authors":"Teruyoshi Kobayashi|Taro Takaguchi","affiliations":"","link_abstract":"http://arxiv.org/abs/1703.10832v2","link_pdf":"http://arxiv.org/pdf/1703.10832v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-018-0143-y","comment":"7 pages, 5 figures, SI included","journal_ref":"EPJ Data Science, 2018 7:15","doi":"10.1140/epjds/s13688-018-0143-y","primary_category":"q-fin.ST","categories":"q-fin.ST|physics.soc-ph"} {"id":"1704.03421v2","submitted":"2017-04-11 17:05:01","updated":"2018-02-26 15:23:31","title":"Efficient Large Scale Clustering based on Data Partitioning","abstract":" Clustering techniques are very attractive for extracting and identifying\npatterns in datasets. However, their application to very large spatial datasets\npresents numerous challenges such as high-dimensionality data, heterogeneity,\nand high complexity of some algorithms. For instance, some algorithms may have\nlinear complexity but they require the domain knowledge in order to determine\ntheir input parameters. Distributed clustering techniques constitute a very\ngood alternative to the big data challenges (e.g.,Volume, Variety, Veracity,\nand Velocity). Usually these techniques consist of two phases. The first phase\ngenerates local models or patterns and the second one tends to aggregate the\nlocal results to obtain global models. While the first phase can be executed in\nparallel on each site and, therefore, efficient, the aggregation phase is\ncomplex, time consuming and may produce incorrect and ambiguous global clusters\nand therefore incorrect models. In this paper we propose a new distributed\nclustering approach to deal efficiently with both phases, generation of local\nresults and generation of global models by aggregation. For the first phase,\nour approach is capable of analysing the datasets located in each site using\ndifferent clustering techniques. The aggregation phase is designed in such a\nway that the final clusters are compact and accurate while the overall process\nis efficient in time and memory allocation. For the evaluation, we use two\nwell-known clustering algorithms, K-Means and DBSCAN. One of the key outputs of\nthis distributed clustering technique is that the number of global clusters is\ndynamic, no need to be fixed in advance. Experimental results show that the\napproach is scalable and produces high quality results.\n","authors":"Malika Bendechache|Nhien-An Le-Khac|M-Tahar Kechadi","affiliations":"","link_abstract":"http://arxiv.org/abs/1704.03421v2","link_pdf":"http://arxiv.org/pdf/1704.03421v2","link_doi":"http://dx.doi.org/10.1109/DSAA.2016.70","comment":"10 pages","journal_ref":"Data Science and Advanced Analytics (DSAA), 2016 IEEE\n International Conference on, 612--621, 2016","doi":"10.1109/DSAA.2016.70","primary_category":"cs.DB","categories":"cs.DB|cs.LG"} {"id":"1704.03528v2","submitted":"2017-04-11 20:35:29","updated":"2018-05-20 14:59:38","title":"Computation of atomic astrophysical opacities","abstract":" The revision of the standard Los Alamos opacities in the 1980-1990s by a\ngroup from the Lawrence Livermore National Laboratory (OPAL) and the Opacity\nProject (OP) consortium was an early example of collaborative big-data science,\nleading to reliable data deliverables (atomic databases, monochromatic\nopacities, mean opacities, and radiative accelerations) widely used since then\nto solve a variety of important astrophysical problems. Nowadays the precision\nof the OPAL and OP opacities, and even of new tables (OPLIB) by Los Alamos, is\na recurrent topic in a hot debate involving stringent comparisons between\ntheory, laboratory experiments, and solar and stellar observations in\nsophisticated research fields: the standard solar model (SSM), helio and\nasteroseismology, non-LTE 3D hydrodynamic photospheric modeling, nuclear\nreaction rates, solar neutrino observations, computational atomic physics, and\nplasma experiments. In this context, an unexpected downward revision of the\nsolar photospheric metal abundances in 2005 spoiled a very precise agreement\nbetween the helioseismic indicators (the radius of the convection zone\nboundary, the sound-speed profile, and helium surface abundance) and SSM\nbenchmarks, which could be somehow reestablished with a substantial opacity\nincrease. Recent laboratory measurements of the iron opacity in physical\nconditions similar to the boundary of the solar convection zone have indeed\npredicted significant increases (30-400%), although new systematic improvements\nand comparisons of the computed tables have not yet been able to reproduce\nthem. We give an overview of this controversy, and within the OP approach,\ndiscuss some of the theoretical shortcomings that could be impairing a more\ncomplete and accurate opacity accounting\n","authors":"Claudio Mendoza","affiliations":"","link_abstract":"http://arxiv.org/abs/1704.03528v2","link_pdf":"http://arxiv.org/pdf/1704.03528v2","link_doi":"http://dx.doi.org/10.3390/atoms6020028","comment":"31 pages, 10 figures. This review is originally based on a talk given\n at the 12th International Colloquium on Atomic Spectra and Oscillator\n Strengths for Astrophysical and Laboratory Plasmas, Sao Paulo, Brazil, July\n 2016. It has been published in the Atoms online journal","journal_ref":"Atoms, 2018, 6(2), 28","doi":"10.3390/atoms6020028","primary_category":"astro-ph.SR","categories":"astro-ph.SR"} {"id":"1704.06977v4","submitted":"2017-04-23 20:43:44","updated":"2019-06-20 23:02:35","title":"Adaptive Estimation in Structured Factor Models with Applications to\n Overlapping Clustering","abstract":" This work introduces a novel estimation method, called LOVE, of the entries\nand structure of a loading matrix A in a sparse latent factor model X = AZ + E,\nfor an observable random vector X in Rp, with correlated unobservable factors Z\n\\in RK, with K unknown, and independent noise E. Each row of A is scaled and\nsparse. In order to identify the loading matrix A, we require the existence of\npure variables, which are components of X that are associated, via A, with one\nand only one latent factor. Despite the fact that the number of factors K, the\nnumber of the pure variables, and their location are all unknown, we only\nrequire a mild condition on the covariance matrix of Z, and a minimum of only\ntwo pure variables per latent factor to show that A is uniquely defined, up to\nsigned permutations. Our proofs for model identifiability are constructive, and\nlead to our novel estimation method of the number of factors and of the set of\npure variables, from a sample of size n of observations on X. This is the first\nstep of our LOVE algorithm, which is optimization-free, and has low\ncomputational complexity of order p2. The second step of LOVE is an easily\nimplementable linear program that estimates A. We prove that the resulting\nestimator is minimax rate optimal up to logarithmic factors in p. The model\nstructure is motivated by the problem of overlapping variable clustering,\nubiquitous in data science. We define the population level clusters as groups\nof those components of X that are associated, via the sparse matrix A, with the\nsame unobservable latent factor, and multi-factor association is allowed.\nClusters are respectively anchored by the pure variables, and form overlapping\nsub-groups of the p-dimensional random vector X. The Latent model approach to\nOVErlapping clustering is reflected in the name of our algorithm, LOVE.\n","authors":"Xin Bing|Florentina Bunea|Yang Ning|Marten Wegkamp","affiliations":"","link_abstract":"http://arxiv.org/abs/1704.06977v4","link_pdf":"http://arxiv.org/pdf/1704.06977v4","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ME","categories":"stat.ME|math.ST|stat.ML|stat.TH"} {"id":"1704.07506v1","submitted":"2017-04-25 01:20:40","updated":"2017-04-25 01:20:40","title":"Some Like it Hoax: Automated Fake News Detection in Social Networks","abstract":" In recent years, the reliability of information on the Internet has emerged\nas a crucial issue of modern society. Social network sites (SNSs) have\nrevolutionized the way in which information is spread by allowing users to\nfreely share content. As a consequence, SNSs are also increasingly used as\nvectors for the diffusion of misinformation and hoaxes. The amount of\ndisseminated information and the rapidity of its diffusion make it practically\nimpossible to assess reliability in a timely manner, highlighting the need for\nautomatic hoax detection systems.\n As a contribution towards this objective, we show that Facebook posts can be\nclassified with high accuracy as hoaxes or non-hoaxes on the basis of the users\nwho \"liked\" them. We present two classification techniques, one based on\nlogistic regression, the other on a novel adaptation of boolean crowdsourcing\nalgorithms. On a dataset consisting of 15,500 Facebook posts and 909,236 users,\nwe obtain classification accuracies exceeding 99% even when the training set\ncontains less than 1% of the posts. We further show that our techniques are\nrobust: they work even when we restrict our attention to the users who like\nboth hoax and non-hoax posts. These results suggest that mapping the diffusion\npattern of information can be a useful component of automatic hoax detection\nsystems.\n","authors":"Eugenio Tacchini|Gabriele Ballarin|Marco L. Della Vedova|Stefano Moret|Luca de Alfaro","affiliations":"","link_abstract":"http://arxiv.org/abs/1704.07506v1","link_pdf":"http://arxiv.org/pdf/1704.07506v1","link_doi":"","comment":"","journal_ref":"Proceedings of the Second Workshop on Data Science for Social Good\n (SoGood), Skopje, Macedonia, 2017. CEUR Workshop Proceedings Volume 1960,\n 2017","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.HC|cs.SI"} {"id":"1704.08848v1","submitted":"2017-04-28 08:47:29","updated":"2017-04-28 08:47:29","title":"Data science skills for referees: I Biological X-ray crystallography","abstract":" Since there is now a growing wish by referees to judge the underpinning data\nfor a submitted article it is timely to provide a summary of the data\nevaluation checks required to be done by a referee. As these checks will vary\nfrom field to field this article focuses on the needs of biological X-ray\ncrystallography articles, which is the predominantly used method leading to\ndepositions in the PDB. The expected referee checks of data underpinning an\narticle are described with examples. These checks necessarily include that a\nreferee checks the PDB validation report for each crystal structure\naccompanying the article submission; this check whilst necessary is not\nsufficient for a complete evaluation. A referee would be expected to undertake\none cycle of model refinement of the authors biological macromolecule\ncoordinates against the authors processed diffraction data and look at the\nvarious validation checks of the model and Fo-Fc electron density maps in e.g.\nPhenix_refine and in COOT. If the referee deems necessary the diffraction data\nimages should be reprocessed (e.g. to a different diffraction resolution than\nthe authors submission). This can be requested to be done by the authors or if\nthe referee prefers can be undertaken directly by the referee themselves. A\nreferee wishing to do these data checks may wish to receive a certificate that\nthey have command of these data science skills. The organisation of such\nvoluntary certification training can e.g. be via those crystallography\nassociations duly recognised by the IUCr to issue such certificates.\n","authors":"John R Helliwell","affiliations":"","link_abstract":"http://arxiv.org/abs/1704.08848v1","link_pdf":"http://arxiv.org/pdf/1704.08848v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"q-bio.BM","categories":"q-bio.BM"} {"id":"1705.00070v1","submitted":"2017-04-28 20:41:17","updated":"2017-04-28 20:41:17","title":"Enabling Interactive Analytics of Secure Data using Cloud Kotta","abstract":" Research, especially in the social sciences and humanities, is increasingly\nreliant on the application of data science methods to analyze large amounts of\n(often private) data. Secure data enclaves provide a solution for managing and\nanalyzing private data. However, such enclaves do not readily support discovery\nscience---a form of exploratory or interactive analysis by which researchers\nexecute a range of (sometimes large) analyses in an iterative and collaborative\nmanner. The batch computing model offered by many data enclaves is well suited\nto executing large compute tasks; however it is far from ideal for day-to-day\ndiscovery science. As researchers must submit jobs to queues and wait for\nresults, the high latencies inherent in queue-based, batch computing systems\nhinder interactive analysis. In this paper we describe how we have augmented\nthe Cloud Kotta secure data enclave to support collaborative and interactive\nanalysis of sensitive data. Our model uses Jupyter notebooks as a flexible\nanalysis environment and Python language constructs to support the execution of\narbitrary functions on private data within this secure framework.\n","authors":"Yadu N. Babuji|Kyle Chard|Eamon Duede","affiliations":"","link_abstract":"http://arxiv.org/abs/1705.00070v1","link_pdf":"http://arxiv.org/pdf/1705.00070v1","link_doi":"","comment":"To appear in Proceedings of Workshop on Scientific Cloud Computing,\n Washington, DC USA, June 2017 (ScienceCloud 2017), 7 pages","journal_ref":"","doi":"","primary_category":"cs.DC","categories":"cs.DC"} {"id":"1705.00894v1","submitted":"2017-05-02 10:35:12","updated":"2017-05-02 10:35:12","title":"Talking Open Data","abstract":" Enticing users into exploring Open Data remains an important challenge for\nthe whole Open Data paradigm. Standard stock interfaces often used by Open Data\nportals are anything but inspiring even for tech-savvy users, let alone those\nwithout an articulated interest in data science. To address a broader range of\ncitizens, we designed an open data search interface supporting natural language\ninteractions via popular platforms like Facebook and Skype. Our data-aware\nchatbot answers search requests and suggests relevant open datasets, bringing\nfun factor and a potential of viral dissemination into Open Data exploration.\nThe current system prototype is available for Facebook\n(https://m.me/OpenDataAssistant) and Skype\n(https://join.skype.com/bot/6db830ca-b365-44c4-9f4d-d423f728e741) users.\n","authors":"Sebastian Neumaier|Vadim Savenkov|Svitlana Vakulenko","affiliations":"","link_abstract":"http://arxiv.org/abs/1705.00894v1","link_pdf":"http://arxiv.org/pdf/1705.00894v1","link_doi":"","comment":"Accepted at ESWC2017 demo track","journal_ref":"","doi":"","primary_category":"cs.IR","categories":"cs.IR"} {"id":"1705.03451v2","submitted":"2017-05-09 17:55:15","updated":"2017-05-26 13:17:17","title":"Proceedings of the Workshop on Data Mining for Oil and Gas","abstract":" The process of exploring and exploiting Oil and Gas (O&G) generates a lot of\ndata that can bring more efficiency to the industry. The opportunities for\nusing data mining techniques in the \"digital oil-field\" remain largely\nunexplored or uncharted. With the high rate of data expansion, companies are\nscrambling to develop ways to develop near-real-time predictive analytics, data\nmining and machine learning capabilities, and are expanding their data storage\ninfrastructure and resources. With these new goals, come the challenges of\nmanaging data growth, integrating intelligence tools, and analyzing the data to\nglean useful insights. Oil and Gas companies need data solutions to\neconomically extract value from very large volumes of a wide variety of data\ngenerated from exploration, well drilling and production devices and sensors.\n Data mining for oil and gas industry throughout the lifecycle of the\nreservoir includes the following roles: locating hydrocarbons, managing\ngeological data, drilling and formation evaluation, well construction, well\ncompletion, and optimizing production through the life of the oil field. For\neach of these phases during the lifecycle of oil field, data mining play a\nsignificant role. Based on which phase were talking about, knowledge creation\nthrough scientific models, data analytics and machine learning, a effective,\nproductive, and on demand data insight is critical for decision making within\nthe organization.\n The significant challenges posed by this complex and economically vital field\njustify a meeting of data scientists that are willing to share their experience\nand knowledge. Thus, the Worskhop on Data Mining for Oil and Gas (DM4OG) aims\nto provide a quality forum for researchers that work on the significant\nchallenges arising from the synergy between data science, machine learning, and\nthe modeling and optimization problems in the O&G industry.\n","authors":"Alipio Jorge|German Larrazabal|Pablo Guillen|Rui L. Lopes","affiliations":"","link_abstract":"http://arxiv.org/abs/1705.03451v2","link_pdf":"http://arxiv.org/pdf/1705.03451v2","link_doi":"http://dx.doi.org/10.13140/RG.2.2.16408.39681","comment":"","journal_ref":"","doi":"10.13140/RG.2.2.16408.39681","primary_category":"cs.AI","categories":"cs.AI|stat.ML"} {"id":"1705.03566v2","submitted":"2017-05-09 23:31:15","updated":"2017-07-12 23:19:02","title":"Spatial Random Sampling: A Structure-Preserving Data Sketching Tool","abstract":" Random column sampling is not guaranteed to yield data sketches that preserve\nthe underlying structures of the data and may not sample sufficiently from\nless-populated data clusters. Also, adaptive sampling can often provide\naccurate low rank approximations, yet may fall short of producing descriptive\ndata sketches, especially when the cluster centers are linearly dependent.\nMotivated by that, this paper introduces a novel randomized column sampling\ntool dubbed Spatial Random Sampling (SRS), in which data points are sampled\nbased on their proximity to randomly sampled points on the unit sphere. The\nmost compelling feature of SRS is that the corresponding probability of\nsampling from a given data cluster is proportional to the surface area the\ncluster occupies on the unit sphere, independently from the size of the cluster\npopulation. Although it is fully randomized, SRS is shown to provide\ndescriptive and balanced data representations. The proposed idea addresses a\npressing need in data science and holds potential to inspire many novel\napproaches for analysis of big data.\n","authors":"Mostafa Rahmani|George Atia","affiliations":"","link_abstract":"http://arxiv.org/abs/1705.03566v2","link_pdf":"http://arxiv.org/pdf/1705.03566v2","link_doi":"http://dx.doi.org/10.1109/LSP.2017.2723472","comment":"","journal_ref":"","doi":"10.1109/LSP.2017.2723472","primary_category":"cs.LG","categories":"cs.LG|stat.ME|stat.ML"} {"id":"1705.03666v1","submitted":"2017-05-10 08:55:55","updated":"2017-05-10 08:55:55","title":"Hybrid PDE solver for data-driven problems and modern branching","abstract":" The numerical solution of large-scale PDEs, such as those occurring in\ndata-driven applications, unavoidably require powerful parallel computers and\ntailored parallel algorithms to make the best possible use of them. In fact,\nconsiderations about the parallelization and scalability of realistic problems\nare often critical enough to warrant acknowledgement in the modelling phase.\nThe purpose of this paper is to spread awareness of the Probabilistic Domain\nDecomposition (PDD) method, a fresh approach to the parallelization of PDEs\nwith excellent scalability properties. The idea exploits the stochastic\nrepresentation of the PDE and its approximation via Monte Carlo in combination\nwith deterministic high-performance PDE solvers. We describe the ingredients of\nPDD and its applicability in the scope of data science. In particular, we\nhighlight recent advances in stochastic representations for nonlinear PDEs\nusing branching diffusions, which have significantly broadened the scope of\nPDD.\n We envision this work as a dictionary giving large-scale PDE practitioners\nreferences on the very latest algorithms and techniques of a non-standard, yet\nhighly parallelizable, methodology at the interface of deterministic and\nprobabilistic numerical methods. We close this work with an invitation to the\nfully nonlinear case and open research questions.\n","authors":"Francisco Bernal|Gonçalo dos Reis|Greig Smith","affiliations":"","link_abstract":"http://arxiv.org/abs/1705.03666v1","link_pdf":"http://arxiv.org/pdf/1705.03666v1","link_doi":"","comment":"23 pages, 7 figures; Final SMUR version; To appear in the European\n Journal of Applied Mathematics (EJAM)","journal_ref":"","doi":"","primary_category":"math.NA","categories":"math.NA|math.PR|q-fin.CP|Primary 65C05, 65C30, Secondary: 65N55, 60H35, 91-XX, 35CXX"} {"id":"1705.07364v3","submitted":"2017-05-20 22:27:19","updated":"2018-02-08 21:57:54","title":"Stabilizing Adversarial Nets With Prediction Methods","abstract":" Adversarial neural networks solve many important problems in data science,\nbut are notoriously difficult to train. These difficulties come from the fact\nthat optimal weights for adversarial nets correspond to saddle points, and not\nminimizers, of the loss function. The alternating stochastic gradient methods\ntypically used for such problems do not reliably converge to saddle points, and\nwhen convergence does happen it is often highly sensitive to learning rates. We\npropose a simple modification of stochastic gradient descent that stabilizes\nadversarial networks. We show, both in theory and practice, that the proposed\nmethod reliably converges to saddle points, and is stable with a wider range of\ntraining parameters than a non-prediction method. This makes adversarial\nnetworks less likely to \"collapse,\" and enables faster training with larger\nlearning rates.\n","authors":"Abhay Yadav|Sohil Shah|Zheng Xu|David Jacobs|Tom Goldstein","affiliations":"","link_abstract":"http://arxiv.org/abs/1705.07364v3","link_pdf":"http://arxiv.org/pdf/1705.07364v3","link_doi":"","comment":"Accepted at ICLR 2018","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.CV|cs.NA"} {"id":"1705.07474v2","submitted":"2017-05-21 16:49:36","updated":"2018-05-29 18:23:30","title":"Why are Big Data Matrices Approximately Low Rank?","abstract":" Matrices of (approximate) low rank are pervasive in data science, appearing\nin recommender systems, movie preferences, topic models, medical records, and\ngenomics. While there is a vast literature on how to exploit low rank structure\nin these datasets, there is less attention on explaining why the low rank\nstructure appears in the first place. Here, we explain the effectiveness of low\nrank models in data science by considering a simple generative model for these\nmatrices: we suppose that each row or column is associated to a (possibly high\ndimensional) bounded latent variable, and entries of the matrix are generated\nby applying a piecewise analytic function to these latent variables. These\nmatrices are in general full rank. However, we show that we can approximate\nevery entry of an $m \\times n$ matrix drawn from this model to within a fixed\nabsolute error by a low rank matrix whose rank grows as $\\mathcal O(\\log(m +\nn))$. Hence any sufficiently large matrix from such a latent variable model can\nbe approximated, up to a small entrywise error, by a low rank matrix.\n","authors":"Madeleine Udell|Alex Townsend","affiliations":"","link_abstract":"http://arxiv.org/abs/1705.07474v2","link_pdf":"http://arxiv.org/pdf/1705.07474v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1705.07747v1","submitted":"2017-05-22 13:53:13","updated":"2017-05-22 13:53:13","title":"What does it all mean? Capturing Semantics of Surgical Data and\n Algorithms with Ontologies","abstract":" Every year approximately 234 million major surgeries are performed, leading\nto plentiful, highly diverse data. This is accompanied by a matching number of\nnovel algorithms for the surgical domain. To garner all benefits of surgical\ndata science it is necessary to have an unambiguous, shared understanding of\nalgorithms and data. This includes inputs and outputs of algorithms and thus\ntheir function, but also the semantic content, i.e. meaning of data such as\npatient parameters. We therefore propose the establishment of a new ontology\nfor data and algorithms in surgical data science. Such an ontology can be used\nto provide common data sets for the community, encouraging sharing of knowledge\nand comparison of algorithms on common data. We hold that this is a necessary\nfoundation towards new methods for applications such as semantic-based content\nretrieval and similarity measures and that it is overall vital for the future\nof surgical data science.\n","authors":"Darko Katić|Maria Maleshkova|Sandy Engelhardt|Ivo Wolf|Keno März|Lena Maier-Hein|Marco Nolden|Martin Wagner|Hannes Kenngott|Beat Peter Müller-Stich|Rüdiger Dillmann|Stefanie Speidel","affiliations":"","link_abstract":"http://arxiv.org/abs/1705.07747v1","link_pdf":"http://arxiv.org/pdf/1705.07747v1","link_doi":"","comment":"4 pages, 1 figure, Surgical Data Science Workshop, Heidelberg, June\n 20th, 2016","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1705.08197v1","submitted":"2017-05-23 11:53:02","updated":"2017-05-23 11:53:02","title":"Learning to Succeed while Teaching to Fail: Privacy in Closed Machine\n Learning Systems","abstract":" Security, privacy, and fairness have become critical in the era of data\nscience and machine learning. More and more we see that achieving universally\nsecure, private, and fair systems is practically impossible. We have seen for\nexample how generative adversarial networks can be used to learn about the\nexpected private training data; how the exploitation of additional data can\nreveal private information in the original one; and how what looks like\nunrelated features can teach us about each other. Confronted with this\nchallenge, in this paper we open a new line of research, where the security,\nprivacy, and fairness is learned and used in a closed environment. The goal is\nto ensure that a given entity (e.g., the company or the government), trusted to\ninfer certain information with our data, is blocked from inferring protected\ninformation from it. For example, a hospital might be allowed to produce\ndiagnosis on the patient (the positive task), without being able to infer the\ngender of the subject (negative task). Similarly, a company can guarantee that\ninternally it is not using the provided data for any undesired task, an\nimportant goal that is not contradicting the virtually impossible challenge of\nblocking everybody from the undesired task. We design a system that learns to\nsucceed on the positive task while simultaneously fail at the negative one, and\nillustrate this with challenging cases where the positive task is actually\nharder than the negative one being blocked. Fairness, to the information in the\nnegative task, is often automatically obtained as a result of this proposed\napproach. The particular framework and examples open the door to security,\nprivacy, and fairness in very important closed scenarios, ranging from private\ndata accumulation companies like social networks to law-enforcement and\nhospitals.\n","authors":"Jure Sokolic|Qiang Qiu|Miguel R. D. Rodrigues|Guillermo Sapiro","affiliations":"","link_abstract":"http://arxiv.org/abs/1705.08197v1","link_pdf":"http://arxiv.org/pdf/1705.08197v1","link_doi":"","comment":"14 pages, 1 figure","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1705.09435v1","submitted":"2017-05-26 05:36:29","updated":"2017-05-26 05:36:29","title":"Deep Learning for Lung Cancer Detection: Tackling the Kaggle Data\n Science Bowl 2017 Challenge","abstract":" We present a deep learning framework for computer-aided lung cancer\ndiagnosis. Our multi-stage framework detects nodules in 3D lung CAT scans,\ndetermines if each nodule is malignant, and finally assigns a cancer\nprobability based on these results. We discuss the challenges and advantages of\nour framework. In the Kaggle Data Science Bowl 2017, our framework ranked 41st\nout of 1972 teams.\n","authors":"Kingsley Kuan|Mathieu Ravaut|Gaurav Manek|Huiling Chen|Jie Lin|Babar Nazir|Cen Chen|Tse Chiang Howe|Zeng Zeng|Vijay Chandrasekhar","affiliations":"","link_abstract":"http://arxiv.org/abs/1705.09435v1","link_pdf":"http://arxiv.org/pdf/1705.09435v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1706.00327v1","submitted":"2017-06-01 14:44:34","updated":"2017-06-01 14:44:34","title":"One button machine for automating feature engineering in relational\n databases","abstract":" Feature engineering is one of the most important and time consuming tasks in\npredictive analytics projects. It involves understanding domain knowledge and\ndata exploration to discover relevant hand-crafted features from raw data. In\nthis paper, we introduce a system called One Button Machine, or OneBM for\nshort, which automates feature discovery in relational databases. OneBM\nautomatically performs a key activity of data scientists, namely, joining of\ndatabase tables and applying advanced data transformations to extract useful\nfeatures from data. We validated OneBM in Kaggle competitions in which OneBM\nachieved performance as good as top 16% to 24% data scientists in three Kaggle\ncompetitions. More importantly, OneBM outperformed the state-of-the-art system\nin a Kaggle competition in terms of prediction accuracy and ranking on Kaggle\nleaderboard. The results show that OneBM can be useful for both data scientists\nand non-experts. It helps data scientists reduce data exploration time allowing\nthem to try and error many ideas in short time. On the other hand, it enables\nnon-experts, who are not familiar with data science, to quickly extract value\nfrom their data with a little effort, time and cost.\n","authors":"Hoang Thanh Lam|Johann-Michael Thiebaut|Mathieu Sinn|Bei Chen|Tiep Mai|Oznur Alkan","affiliations":"","link_abstract":"http://arxiv.org/abs/1706.00327v1","link_pdf":"http://arxiv.org/pdf/1706.00327v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB|cs.AI"} {"id":"1706.01214v1","submitted":"2017-06-05 06:53:30","updated":"2017-06-05 06:53:30","title":"Inconsistent Node Flattening for Improving Top-down Hierarchical\n Classification","abstract":" Large-scale classification of data where classes are structurally organized\nin a hierarchy is an important area of research. Top-down approaches that\nexploit the hierarchy during the learning and prediction phase are efficient\nfor large scale hierarchical classification. However, accuracy of top-down\napproaches is poor due to error propagation i.e., prediction errors made at\nhigher levels in the hierarchy cannot be corrected at lower levels. One of the\nmain reason behind errors at the higher levels is the presence of inconsistent\nnodes that are introduced due to the arbitrary process of creating these\nhierarchies by domain experts. In this paper, we propose two different\ndata-driven approaches (local and global) for hierarchical structure\nmodification that identifies and flattens inconsistent nodes present within the\nhierarchy. Our extensive empirical evaluation of the proposed approaches on\nseveral image and text datasets with varying distribution of features, classes\nand training instances per class shows improved classification performance over\ncompeting hierarchical modification approaches. Specifically, we see an\nimprovement upto 7% in Macro-F1 score with our approach over best TD baseline.\nSOURCE CODE: http://www.cs.gmu.edu/~mlbio/InconsistentNodeFlattening\n","authors":"Azad Naik|Huzefa Rangwala","affiliations":"","link_abstract":"http://arxiv.org/abs/1706.01214v1","link_pdf":"http://arxiv.org/pdf/1706.01214v1","link_doi":"","comment":"IEEE International Conference on Data Science and Advanced Analytics\n (DSAA), 2016","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1706.02046v1","submitted":"2017-06-07 04:56:25","updated":"2017-06-07 04:56:25","title":"Conditional independence test for categorical data using Poisson\n log-linear model","abstract":" We demonstrate how to test for conditional independence of two variables with\ncategorical data using Poisson log-linear models. The size of the conditioning\nset of variables can vary from 0 (simple independence) up to many variables. We\nalso provide a function in R for performing the test. Instead of calculating\nall possible tables with for loop we perform the test using the log-linear\nmodels and thus speeding up the process. Time comparison simulation studies are\npresented.\n","authors":"Michail Tsagris","affiliations":"","link_abstract":"http://arxiv.org/abs/1706.02046v1","link_pdf":"http://arxiv.org/pdf/1706.02046v1","link_doi":"","comment":"11 pages and 1 Figure","journal_ref":"Journal of Data Science, 2017, Volume 15(2): 347-356","doi":"","primary_category":"stat.ME","categories":"stat.ME"} {"id":"1706.02447v1","submitted":"2017-06-08 03:38:27","updated":"2017-06-08 03:38:27","title":"Luck is Hard to Beat: The Difficulty of Sports Prediction","abstract":" Predicting the outcome of sports events is a hard task. We quantify this\ndifficulty with a coefficient that measures the distance between the observed\nfinal results of sports leagues and idealized perfectly balanced competitions\nin terms of skill. This indicates the relative presence of luck and skill. We\ncollected and analyzed all games from 198 sports leagues comprising 1503\nseasons from 84 countries of 4 different sports: basketball, soccer, volleyball\nand handball. We measured the competitiveness by countries and sports. We also\nidentify in each season which teams, if removed from its league, result in a\ncompletely random tournament. Surprisingly, not many of them are needed. As\nanother contribution of this paper, we propose a probabilistic graphical model\nto learn about the teams' skills and to decompose the relative weights of luck\nand skill in each game. We break down the skill component into factors\nassociated with the teams' characteristics. The model also allows to estimate\nas 0.36 the probability that an underdog team wins in the NBA league, with a\nhome advantage adding 0.09 to this probability. As shown in the first part of\nthe paper, luck is substantially present even in the most competitive\nchampionships, which partially explains why sophisticated and complex\nfeature-based models hardly beat simple models in the task of forecasting\nsports' outcomes.\n","authors":"Raquel YS Aoki|Renato M Assuncao|Pedro OS Vaz de Melo","affiliations":"","link_abstract":"http://arxiv.org/abs/1706.02447v1","link_pdf":"http://arxiv.org/pdf/1706.02447v1","link_doi":"http://dx.doi.org/10.1145/3097983.3098045","comment":"10 pages, KDD2017, Applied Data Science track","journal_ref":"Proceedings of the 23rd ACM SIGKDD International Conference on\n Knowledge Discovery and Data Mining, 2017","doi":"10.1145/3097983.3098045","primary_category":"cs.LG","categories":"cs.LG|stat.AP"} {"id":"1706.03102v1","submitted":"2017-06-09 19:45:28","updated":"2017-06-09 19:45:28","title":"Big Data, Data Science, and Civil Rights","abstract":" Advances in data analytics bring with them civil rights implications.\nData-driven and algorithmic decision making increasingly determine how\nbusinesses target advertisements to consumers, how police departments monitor\nindividuals or groups, how banks decide who gets a loan and who does not, how\nemployers hire, how colleges and universities make admissions and financial aid\ndecisions, and much more. As data-driven decisions increasingly affect every\ncorner of our lives, there is an urgent need to ensure they do not become\ninstruments of discrimination, barriers to equality, threats to social justice,\nand sources of unfairness. In this paper, we argue for a concrete research\nagenda aimed at addressing these concerns, comprising five areas of emphasis:\n(i) Determining if models and modeling procedures exhibit objectionable bias;\n(ii) Building awareness of fairness into machine learning methods; (iii)\nImproving the transparency and control of data- and model-driven decision\nmaking; (iv) Looking beyond the algorithm(s) for sources of bias and\nunfairness-in the myriad human decisions made during the problem formulation\nand modeling process; and (v) Supporting the cross-disciplinary scholarship\nnecessary to do all of that well.\n","authors":"Solon Barocas|Elizabeth Bradley|Vasant Honavar|Foster Provost","affiliations":"","link_abstract":"http://arxiv.org/abs/1706.03102v1","link_pdf":"http://arxiv.org/pdf/1706.03102v1","link_doi":"","comment":"A Computing Community Consortium (CCC) white paper, 8 pages","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1706.05858v2","submitted":"2017-06-19 10:06:01","updated":"2018-03-21 05:50:12","title":"Feature analysis of multidisciplinary scientific collaboration patterns\n based on PNAS","abstract":" The features of collaboration patterns are often considered to be different\nfrom discipline to discipline. Meanwhile, collaborating among disciplines is an\nobvious feature emerged in modern scientific research, which incubates several\ninterdisciplines. The features of collaborations in and among the disciplines\nof biological, physical and social sciences are analyzed based on 52,803 papers\npublished in a multidisciplinary journal PNAS during 1999 to 2013. From those\ndata, we found similar transitivity and assortativity of collaboration patterns\nas well as the identical distribution type of collaborators per author and that\nof papers per author, namely a mixture of generalized Poisson and power-law\ndistributions. In addition, we found that interdisciplinary research is\nundertaken by a considerable fraction of authors, not just those with many\ncollaborators or those with many papers. This case study provides a window for\nunderstanding aspects of multidisciplinary and interdisciplinary collaboration\npatterns.\n","authors":"Zheng Xie|Miao Li|Jianping Li|Xiaojun Duan|Zhenzheng Ouyang","affiliations":"","link_abstract":"http://arxiv.org/abs/1706.05858v2","link_pdf":"http://arxiv.org/pdf/1706.05858v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-018-0134-z","comment":"","journal_ref":"Xie, Z., Li, M., Li, J., Duan, X., & Ouyang, Z. (2018). Feature\n analysis of multidisciplinary scientific collaboration patterns based on\n PNAS. EPJ Data Science, 7(1), 5","doi":"10.1140/epjds/s13688-018-0134-z","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.DL"} {"id":"1706.07002v2","submitted":"2017-06-21 16:49:39","updated":"2018-10-19 12:38:24","title":"Uncertainty-Aware Organ Classification for Surgical Data Science\n Applications in Laparoscopy","abstract":" Objective: Surgical data science is evolving into a research field that aims\nto observe everything occurring within and around the treatment process to\nprovide situation-aware data-driven assistance. In the context of endoscopic\nvideo analysis, the accurate classification of organs in the field of view of\nthe camera proffers a technical challenge. Herein, we propose a new approach to\nanatomical structure classification and image tagging that features an\nintrinsic measure of confidence to estimate its own performance with high\nreliability and which can be applied to both RGB and multispectral imaging (MI)\ndata. Methods: Organ recognition is performed using a superpixel classification\nstrategy based on textural and reflectance information. Classification\nconfidence is estimated by analyzing the dispersion of class probabilities.\nAssessment of the proposed technology is performed through a comprehensive in\nvivo study with seven pigs. Results: When applied to image tagging, mean\naccuracy in our experiments increased from 65% (RGB) and 80% (MI) to 90% (RGB)\nand 96% (MI) with the confidence measure. Conclusion: Results showed that the\nconfidence measure had a significant influence on the classification accuracy,\nand MI data are better suited for anatomical structure labeling than RGB data.\nSignificance: This work significantly enhances the state of art in automatic\nlabeling of endoscopic videos by introducing the use of the confidence metric,\nand by being the first study to use MI data for in vivo laparoscopic tissue\nclassification. The data of our experiments will be released as the first in\nvivo MI dataset upon publication of this paper.\n","authors":"S. Moccia|S. J. Wirkert|H. Kenngott|A. S. Vemuri|M. Apitz|B. Mayer|E. De Momi|L. S. Mattos|L. Maier-Hein","affiliations":"","link_abstract":"http://arxiv.org/abs/1706.07002v2","link_pdf":"http://arxiv.org/pdf/1706.07002v2","link_doi":"http://dx.doi.org/10.1109/TBME.2018.2813015","comment":"7 pages, 6 images, 2 tables","journal_ref":"","doi":"10.1109/TBME.2018.2813015","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1706.07450v2","submitted":"2017-06-22 18:18:58","updated":"2018-08-30 20:27:20","title":"Revised Note on Learning Algorithms for Quadratic Assignment with Graph\n Neural Networks","abstract":" Inverse problems correspond to a certain type of optimization problems\nformulated over appropriate input distributions. Recently, there has been a\ngrowing interest in understanding the computational hardness of these\noptimization problems, not only in the worst case, but in an average-complexity\nsense under this same input distribution.\n In this revised note, we are interested in studying another aspect of\nhardness, related to the ability to learn how to solve a problem by simply\nobserving a collection of previously solved instances. These 'planted\nsolutions' are used to supervise the training of an appropriate predictive\nmodel that parametrizes a broad class of algorithms, with the hope that the\nresulting model will provide good accuracy-complexity tradeoffs in the average\nsense.\n We illustrate this setup on the Quadratic Assignment Problem, a fundamental\nproblem in Network Science. We observe that data-driven models based on Graph\nNeural Networks offer intriguingly good performance, even in regimes where\nstandard relaxation based techniques appear to suffer.\n","authors":"Alex Nowak|Soledad Villar|Afonso S. Bandeira|Joan Bruna","affiliations":"","link_abstract":"http://arxiv.org/abs/1706.07450v2","link_pdf":"http://arxiv.org/pdf/1706.07450v2","link_doi":"","comment":"Revised note to arXiv:1706.07450v1 that appeared in IEEE Data Science\n Workshop 2018","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1706.08126v2","submitted":"2017-06-25 15:41:25","updated":"2017-07-04 19:53:12","title":"ToolNet: Holistically-Nested Real-Time Segmentation of Robotic Surgical\n Tools","abstract":" Real-time tool segmentation from endoscopic videos is an essential part of\nmany computer-assisted robotic surgical systems and of critical importance in\nrobotic surgical data science. We propose two novel deep learning architectures\nfor automatic segmentation of non-rigid surgical instruments. Both methods take\nadvantage of automated deep-learning-based multi-scale feature extraction while\ntrying to maintain an accurate segmentation quality at all resolutions. The two\nproposed methods encode the multi-scale constraint inside the network\narchitecture. The first proposed architecture enforces it by cascaded\naggregation of predictions and the second proposed network does it by means of\na holistically-nested architecture where the loss at each scale is taken into\naccount for the optimization process. As the proposed methods are for real-time\nsemantic labeling, both present a reduced number of parameters. We propose the\nuse of parametric rectified linear units for semantic labeling in these small\narchitectures to increase the regularization ability of the design and maintain\nthe segmentation accuracy without overfitting the training sets. We compare the\nproposed architectures against state-of-the-art fully convolutional networks.\nWe validate our methods using existing benchmark datasets, including ex vivo\ncases with phantom tissue and different robotic surgical instruments present in\nthe scene. Our results show a statistically significant improved Dice\nSimilarity Coefficient over previous instrument segmentation methods. We\nanalyze our design choices and discuss the key drivers for improving accuracy.\n","authors":"Luis C. Garcia-Peraza-Herrera|Wenqi Li|Lucas Fidon|Caspar Gruijthuijsen|Alain Devreker|George Attilakos|Jan Deprest|Emmanuel Vander Poorten|Danail Stoyanov|Tom Vercauteren|Sebastien Ourselin","affiliations":"","link_abstract":"http://arxiv.org/abs/1706.08126v2","link_pdf":"http://arxiv.org/pdf/1706.08126v2","link_doi":"http://dx.doi.org/10.1109/IROS.2017.8206462","comment":"Paper accepted at IROS 2017","journal_ref":"","doi":"10.1109/IROS.2017.8206462","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1706.09308v2","submitted":"2017-06-28 14:16:56","updated":"2017-07-14 14:38:55","title":"A New Urban Objects Detection Framework Using Weakly Annotated Sets","abstract":" Urban informatics explore data science methods to address different urban\nissues intensively based on data. The large variety and quantity of data\navailable should be explored but this brings important challenges. For\ninstance, although there are powerful computer vision methods that may be\nexplored, they may require large annotated datasets. In this work we propose a\nnovel approach to automatically creating an object recognition system with\nminimal manual annotation. The basic idea behind the method is to use large\ninput datasets using available online cameras on large cities. A off-the-shelf\nweak classifier is used to detect an initial set of urban elements of interest\n(e.g. cars, pedestrians, bikes, etc.). Such initial dataset undergoes a quality\ncontrol procedure and it is subsequently used to fine tune a strong classifier.\nQuality control and comparative performance assessment are used as part of the\npipeline. We evaluate the method for detecting cars based on monitoring\ncameras. Experimental results using real data show that despite losing\ngenerality, the final detector provides better detection rates tailored to the\nselected cameras. The programmed robot gathered 770 video hours from 24 online\ncity cameras (\\~300GB), which has been fed to the proposed system. Our approach\nhas shown that the method nearly doubled the recall (93\\%) with respect to\nstate-of-the-art methods using off-the-shelf algorithms.\n","authors":"Eric Keiji|Gabriel Ferreira|Claudio Silva|Roberto M. Cesar Jr","affiliations":"","link_abstract":"http://arxiv.org/abs/1706.09308v2","link_pdf":"http://arxiv.org/pdf/1706.09308v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1706.09699v2","submitted":"2017-06-29 12:00:33","updated":"2017-10-28 21:58:35","title":"Nonnegative Factorization of a Data Matrix as a Motivational Example for\n Basic Linear Algebra","abstract":" We present a motivating example for matrix multiplication based on factoring\na data matrix. Traditionally, matrix multiplication is motivated by\napplications in physics: composing rigid transformations, scaling, sheering,\netc. We present an engaging modern example which naturally motivates a variety\nof matrix manipulations, and a variety of different ways of viewing matrix\nmultiplication. We exhibit a low-rank non-negative decomposition (NMF) of a\n\"data matrix\" whose entries are word frequencies across a corpus of documents.\nWe then explore the meaning of the entries in the decomposition, find natural\ninterpretations of intermediate quantities that arise in several different ways\nof writing the matrix product, and show the utility of various matrix\noperations. This example gives the students a glimpse of the power of an\nadvanced linear algebraic technique used in modern data science.\n","authors":"Barak A. Pearlmutter|Helena Šmigoc","affiliations":"","link_abstract":"http://arxiv.org/abs/1706.09699v2","link_pdf":"http://arxiv.org/pdf/1706.09699v2","link_doi":"http://dx.doi.org/10.1007/978-3-319-66811-6_15","comment":"","journal_ref":"","doi":"10.1007/978-3-319-66811-6_15","primary_category":"math.HO","categories":"math.HO|97M10"} {"id":"1707.00883v1","submitted":"2017-07-04 09:55:18","updated":"2017-07-04 09:55:18","title":"Space-Time Analysis of Movements in Basketball using Sensor Data","abstract":" Global Positioning Systems (GPS) are nowadays intensively used in Sport\nScience as they permit to capture the space-time trajectories of players, with\nthe aim to infer useful information to coaches in addition to traditional\nstatistics. In our application to basketball, we used Cluster Analysis in order\nto split the match in a number of separate time-periods, each identifying\nhomogeneous spatial relations among players in the court. Results allowed us to\nidentify differences in spacing among players, distinguish defensive or\noffensive actions, analyze transition probabilities from a certain group to\nanother one.\n","authors":"Rodolfo Metulini|Marica Manisera|Paola Zuccolotto","affiliations":"","link_abstract":"http://arxiv.org/abs/1707.00883v1","link_pdf":"http://arxiv.org/pdf/1707.00883v1","link_doi":"","comment":"7 pages, 3 figures, proceedings of SIS17: Statistics and Data\n Science: New Challenges, New Generations","journal_ref":"","doi":"","primary_category":"stat.AP","categories":"stat.AP"} {"id":"1707.00943v2","submitted":"2017-07-04 12:42:46","updated":"2019-06-03 14:36:09","title":"The sample complexity of multi-reference alignment","abstract":" The growing role of data-driven approaches to scientific discovery has\nunveiled a large class of models that involve latent transformations with a\nrigid algebraic constraint. Three-dimensional molecule reconstruction in\nCryo-Electron Microscopy (cryo-EM) is a central problem in this class. Despite\ndecades of algorithmic and software development, there is still little\ntheoretical understanding of the sample complexity of this problem, that is,\nnumber of images required for 3-D reconstruction. Here we consider\nmulti-reference alignment (MRA), a simple model that captures fundamental\naspects of the statistical and algorithmic challenges arising in cryo-EM and\nrelated problems. In MRA, an unknown signal is subject to two types of\ncorruption: a latent cyclic shift and the more traditional additive white\nnoise. The goal is to recover the signal at a certain precision from\nindependent samples. While at high signal-to-noise ratio (SNR), the number of\nobservations needed to recover a generic signal is proportional to\n$1/\\mathrm{SNR}$, we prove that it rises to a surprising $1/\\mathrm{SNR}^3$ in\nthe low SNR regime. This precise phenomenon was observed empirically more than\ntwenty years ago for cryo-EM but has remained unexplained to date. Furthermore,\nour techniques can easily be extended to the heterogeneous MRA model where the\nsamples come from a mixture of signals, as is often the case in applications\nsuch as cryo-EM, where molecules may have different conformations. This\nprovides a first step towards a statistical theory for heterogeneous cryo-EM.\n","authors":"Amelia Perry|Jonathan Weed|Afonso S. Bandeira|Philippe Rigollet|Amit Singer","affiliations":"","link_abstract":"http://arxiv.org/abs/1707.00943v2","link_pdf":"http://arxiv.org/pdf/1707.00943v2","link_doi":"","comment":"To appear in SIAM Journal on Mathematics of Data Science","journal_ref":"","doi":"","primary_category":"cs.IT","categories":"cs.IT|cs.DS|math.IT|math.ST|stat.TH|62B10, 92C55"} {"id":"1707.01469v1","submitted":"2017-07-05 17:05:54","updated":"2017-07-05 17:05:54","title":"Synthesis of Data Completion Scripts using Finite Tree Automata","abstract":" In application domains that store data in a tabular format, a common task is\nto fill the values of some cells using values stored in other cells. For\ninstance, such data completion tasks arise in the context of missing value\nimputation in data science and derived data computation in spreadsheets and\nrelational databases. Unfortunately, end-users and data scientists typically\nstruggle with many data completion tasks that require non-trivial programming\nexpertise. This paper presents a synthesis technique for automating data\ncompletion tasks using programming-by-example (PBE) and a very lightweight\nsketching approach. Given a formula sketch (e.g., AVG($?_1$, $?_2$)) and a few\ninput-output examples for each hole, our technique synthesizes a program to\nautomate the desired data completion task. Towards this goal, we propose a\ndomain-specific language (DSL) that combines spatial and relational reasoning\nover tabular data and a novel synthesis algorithm that can generate DSL\nprograms that are consistent with the input-output examples. The key technical\nnovelty of our approach is a new version space learning algorithm that is based\non finite tree automata (FTA). The use of FTAs in the learning algorithm leads\nto a more compact representation that allows more sharing between programs that\nare consistent with the examples. We have implemented the proposed approach in\na tool called DACE and evaluate it on 84 benchmarks taken from online help\nforums. We also illustrate the advantages of our approach by comparing our\ntechnique against two existing synthesizers, namely PROSE and SKETCH.\n","authors":"Xinyu Wang|Isil Dillig|Rishabh Singh","affiliations":"","link_abstract":"http://arxiv.org/abs/1707.01469v1","link_pdf":"http://arxiv.org/pdf/1707.01469v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.PL","categories":"cs.PL"} {"id":"1707.01591v1","submitted":"2017-07-05 22:12:14","updated":"2017-07-05 22:12:14","title":"A Data Science Approach to Understanding Residential Water Contamination\n in Flint","abstract":" When the residents of Flint learned that lead had contaminated their water\nsystem, the local government made water-testing kits available to them free of\ncharge. The city government published the results of these tests, creating a\nvaluable dataset that is key to understanding the causes and extent of the lead\ncontamination event in Flint. This is the nation's largest dataset on lead in a\nmunicipal water system.\n In this paper, we predict the lead contamination for each household's water\nsupply, and we study several related aspects of Flint's water troubles, many of\nwhich generalize well beyond this one city. For example, we show that elevated\nlead risks can be (weakly) predicted from observable home attributes. Then we\nexplore the factors associated with elevated lead. These risk assessments were\ndeveloped in part via a crowd sourced prediction challenge at the University of\nMichigan. To inform Flint residents of these assessments, they have been\nincorporated into a web and mobile application funded by \\texttt{Google.org}.\nWe also explore questions of self-selection in the residential testing program,\nexamining which factors are linked to when and how frequently residents\nvoluntarily sample their water.\n","authors":"Alex Chojnacki|Chengyu Dai|Arya Farahi|Guangsha Shi|Jared Webb|Daniel T. Zhang|Jacob Abernethy|Eric Schwartz","affiliations":"","link_abstract":"http://arxiv.org/abs/1707.01591v1","link_pdf":"http://arxiv.org/pdf/1707.01591v1","link_doi":"http://dx.doi.org/10.1145/3097983.3098078","comment":"Applied Data Science track paper at KDD 2017. For associated\n promotional video, see https://www.youtube.com/watch?v=0g66ImaV8Ag","journal_ref":"","doi":"10.1145/3097983.3098078","primary_category":"cs.LG","categories":"cs.LG|stat.AP|stat.ML"} {"id":"1707.04295v1","submitted":"2017-07-13 20:08:24","updated":"2017-07-13 20:08:24","title":"Approximation Schemes for Clustering with Outliers","abstract":" Clustering problems are well-studied in a variety of fields such as data\nscience, operations research, and computer science. Such problems include\nvariants of centre location problems, $k$-median, and $k$-means to name a few.\nIn some cases, not all data points need to be clustered; some may be discarded\nfor various reasons.\n We study clustering problems with outliers. More specifically, we look at\nUncapacitated Facility Location (UFL), $k$-Median, and $k$-Means. In UFL with\noutliers, we have to open some centres, discard up to $z$ points of $\\cal X$\nand assign every other point to the nearest open centre, minimizing the total\nassignment cost plus centre opening costs. In $k$-Median and $k$-Means, we have\nto open up to $k$ centres but there are no opening costs. In $k$-Means, the\ncost of assigning $j$ to $i$ is $\\delta^2(j,i)$. We present several results.\nOur main focus is on cases where $\\delta$ is a doubling metric or is the\nshortest path metrics of graphs from a minor-closed family of graphs. For\nuniform-cost UFL with outliers on such metrics we show that a multiswap simple\nlocal search heuristic yields a PTAS. With a bit more work, we extend this to\nbicriteria approximations for the $k$-Median and $k$-Means problems in the same\nmetrics where, for any constant $\\epsilon > 0$, we can find a solution using\n$(1+\\epsilon)k$ centres whose cost is at most a $(1+\\epsilon)$-factor of the\noptimum and uses at most $z$ outliers. We also show that natural local search\nheuristics that do not violate the number of clusters and outliers for\n$k$-Median (or $k$-Means) will have unbounded gap even in Euclidean metrics.\nFurthermore, we show how our analysis can be extended to general metrics for\n$k$-Means with outliers to obtain a $(25+\\epsilon,1+\\epsilon)$ bicriteria.\n","authors":"Zachary Friggstad|Kamyar Khodamoradi|Mohsen Rezapour|Mohammad R. Salavatipour","affiliations":"","link_abstract":"http://arxiv.org/abs/1707.04295v1","link_pdf":"http://arxiv.org/pdf/1707.04295v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DS","categories":"cs.DS"} {"id":"1707.05015v1","submitted":"2017-07-17 06:55:43","updated":"2017-07-17 06:55:43","title":"Iris: A Conversational Agent for Complex Tasks","abstract":" Today's conversational agents are restricted to simple standalone commands.\nIn this paper, we present Iris, an agent that draws on human conversational\nstrategies to combine commands, allowing it to perform more complex tasks that\nit has not been explicitly designed to support: for example, composing one\ncommand to \"plot a histogram\" with another to first \"log-transform the data\".\nTo enable this complexity, we introduce a domain specific language that\ntransforms commands into automata that Iris can compose, sequence, and execute\ndynamically by interacting with a user through natural language, as well as a\nconversational type system that manages what kinds of commands can be combined.\nWe have designed Iris to help users with data science tasks, a domain that\nrequires support for command combination. In evaluation, we find that data\nscientists complete a predictive modeling task significantly faster (2.6 times\nspeedup) with Iris than a modern non-conversational programming environment.\nIris supports the same kinds of commands as today's agents, but empowers users\nto weave together these commands to accomplish complex goals.\n","authors":"Ethan Fast|Binbin Chen|Julia Mendelsohn|Jonathan Bassen|Michael Bernstein","affiliations":"","link_abstract":"http://arxiv.org/abs/1707.05015v1","link_pdf":"http://arxiv.org/pdf/1707.05015v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.HC","categories":"cs.HC|cs.CL"} {"id":"1707.06071v1","submitted":"2017-07-19 13:15:00","updated":"2017-07-19 13:15:00","title":"Malware distributions and graph structure of the Web","abstract":" Knowledge about the graph structure of the Web is important for understanding\nthis complex socio-technical system and for devising proper policies supporting\nits future development. Knowledge about the differences between clean and\nmalicious parts of the Web is important for understanding potential treats to\nits users and for devising protection mechanisms. In this study, we conduct\ndata science methods on a large crawl of surface and deep Web pages with the\naim to increase such knowledge. To accomplish this, we answer the following\nquestions. Which theoretical distributions explain important local\ncharacteristics and network properties of websites? How are these\ncharacteristics and properties different between clean and malicious\n(malware-affected) websites? What is the prediction power of local\ncharacteristics and network properties to classify malware websites? To the\nbest of our knowledge, this is the first large-scale study describing the\ndifferences in global properties between malicious and clean parts of the Web.\nIn other words, our work is building on and bridging the gap between\n\\textit{Web science} that tackles large-scale graph representations and\n\\textit{Web cyber security} that is concerned with malicious activities on the\nWeb. The results presented herein can also help antivirus vendors in devising\napproaches to improve their detection algorithms.\n","authors":"Sanja Šćepanović|Igor Mishkovski|Jukka Ruohonen|Frederick Ayala-Gómez|Tuomas Aura|Sami Hyrynsalmi","affiliations":"","link_abstract":"http://arxiv.org/abs/1707.06071v1","link_pdf":"http://arxiv.org/pdf/1707.06071v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI"} {"id":"1707.07029v1","submitted":"2017-07-21 19:34:44","updated":"2017-07-21 19:34:44","title":"Data, Science and Society","abstract":" Reflections on the Concept of Data and its Implications for Science and\nSociety\n","authors":"Claudio Gutierrez","affiliations":"","link_abstract":"http://arxiv.org/abs/1707.07029v1","link_pdf":"http://arxiv.org/pdf/1707.07029v1","link_doi":"","comment":"Notes for a talk at LEARN Final Conference, Economic Commission for\n Latin America and the Caribbean (ECLAC), Senate House, University of London,\n London, May 5th., 2017","journal_ref":"","doi":"","primary_category":"cs.DL","categories":"cs.DL"} {"id":"1707.07799v1","submitted":"2017-07-25 03:05:39","updated":"2017-07-25 03:05:39","title":"Block Approximation of Tall Sparse Matrices and Block-Givens Rotations","abstract":" Estimation of top singular values is one of the widely used techniques and\none of the intensively researched problems in Numerical Linear Algebra and Data\nScience. We consider here two general questions related to this problem:\n How top singular values are affected by zeroing out a sparse rectangular\nblock of a matrix?\n How much top singular values differ from top column norms of a tall sparse\nnon-negative matrix ?\n","authors":"Alexander Kushkuley","affiliations":"","link_abstract":"http://arxiv.org/abs/1707.07799v1","link_pdf":"http://arxiv.org/pdf/1707.07799v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.NA","categories":"math.NA|15A18 (Primary), 15B52, 65F50 (Secondary)"} {"id":"1707.08632v2","submitted":"2017-07-26 20:32:56","updated":"2018-06-18 15:19:21","title":"Complex delay dynamics on railway networks: from universal laws to\n realistic modelling","abstract":" Railways are a key infrastructure for any modern country. The reliability and\nresilience of this peculiar transportation system may be challenged by\ndifferent shocks such as disruptions, strikes and adverse weather conditions.\nThese events compromise the correct functioning of the system and trigger the\nspreading of delays into the railway network on a daily basis. Despite their\nimportance, a general theoretical understanding of the underlying causes of\nthese disruptions is still lacking. In this work, we analyse the Italian and\nGerman railway networks by leveraging on the train schedules and actual delay\ndata retrieved during the year 2015. We use {these} data to infer simple\nstatistical laws ruling the emergence of localized delays in different areas of\nthe network and we model the spreading of these delays throughout the network\nby exploiting a framework inspired by epidemic spreading models. Our model\noffers a fast and easy tool for the preliminary assessment of the\n{effectiveness of} traffic handling policies, and of the railway {network}\ncriticalities.\n","authors":"Bernardo Monechi|Pietro Gravino|Riccardo di Clemente|Vito D. P. Servedio","affiliations":"","link_abstract":"http://arxiv.org/abs/1707.08632v2","link_pdf":"http://arxiv.org/pdf/1707.08632v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-018-0160-x","comment":"32 pages (with appendix), 28 Figures (with appendix), 2 Tables","journal_ref":"EPJ Data Science 2018 7:35","doi":"10.1140/epjds/s13688-018-0160-x","primary_category":"physics.soc-ph","categories":"physics.soc-ph"} {"id":"1708.04664v1","submitted":"2017-08-05 13:15:36","updated":"2017-08-05 13:15:36","title":"A Novel data Pre-processing method for multi-dimensional and non-uniform\n data","abstract":" We are in the era of data analytics and data science which is on full bloom.\nThere is abundance of all kinds of data for example biometrics based data,\nsatellite images data, chip-seq data, social network data, sensor based data\netc. from a variety of sources. This data abundance is the result of the fact\nthat storage cost is getting cheaper day by day, so people as well as almost\nall business or scientific organizations are storing more and more data. Most\nof the real data is multi-dimensional, non-uniform, and big in size, such that\nit requires a unique pre-processing before analyzing it. In order to make data\nuseful for any kind of analysis, pre-processing is a very important step. This\npaper presents a unique and novel pre-processing method for multi-dimensional\nand non-uniform data with the aim of making it uniform and reduced in size\nwithout losing much of its value. We have chosen biometric signature data to\ndemonstrate the proposed method as it qualifies for the attributes of being\nmulti-dimensional, non-uniform and big in size. Biometric signature data does\nnot only captures the structural characteristics of a signature but also its\nbehavioral characteristics that are captured using a dynamic signature capture\ndevice. These features like pen pressure, pen tilt angle, time taken to sign a\ndocument when collected in real-time turn out to be of varying dimensions. This\nfeature data set along with the structural data needs to be pre-processed in\norder to use it to train a machine learning based model for signature\nverification purposes. We demonstrate the success of the proposed method over\nother methods using experimental results for biometric signature data but the\nsame can be implemented for any other data with similar properties from a\ndifferent domain.\n","authors":"Farhana Javed Zareen|Suraiya Jabin","affiliations":"","link_abstract":"http://arxiv.org/abs/1708.04664v1","link_pdf":"http://arxiv.org/pdf/1708.04664v1","link_doi":"","comment":"11 pages, 4 Figures, 7 Tables","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1708.01944v1","submitted":"2017-08-06 22:20:02","updated":"2017-08-06 22:20:02","title":"Rookie: A unique approach for exploring news archives","abstract":" News archives are an invaluable primary source for placing current events in\nhistorical context. But current search engine tools do a poor job at uncovering\nbroad themes and narratives across documents. We present Rookie: a practical\nsoftware system which uses natural language processing (NLP) to help readers,\nreporters and editors uncover broad stories in news archives. Unlike prior\nwork, Rookie's design emerged from 18 months of iterative development in\nconsultation with editors and computational journalists. This process lead to a\ndramatically different approach from previous academic systems with similar\ngoals. Our efforts offer a generalizable case study for others building\nreal-world journalism software using NLP.\n","authors":"Abram Handler|Brendan O'Connor","affiliations":"","link_abstract":"http://arxiv.org/abs/1708.01944v1","link_pdf":"http://arxiv.org/pdf/1708.01944v1","link_doi":"","comment":"Presented at KDD 2017: Data Science + Journalism workshop","journal_ref":"","doi":"","primary_category":"cs.HC","categories":"cs.HC|cs.CL"} {"id":"1708.02273v1","submitted":"2017-08-07 19:10:50","updated":"2017-08-07 19:10:50","title":"An Approach with Toric Varieties for Singular Learning Machines","abstract":" The Computational Algebraic Geometry applied in Algebraic Statistics; are\nbeginning to exploring new branches and applications; in artificial\nintelligence and others areas. Currently, the development of the mathematics is\nvery extensive and it is difficult to see the immediate application of few\ntheorems in different areas, such as is the case of the Theorem 3.9 given in\n[10] and proved in part of here. Also this work has the intention to show the\nHilbert basis as a powerful tool in data science; and for that reason we\ncompile important results proved in works by, S. Watanabe [27], D. Cox, J.\nLittle and H. Schenck [8], B. Sturmfels [16] and G. Ewald [10]. In this work we\nstudy, first, the fundamental concepts in Toric Algebraic Geometry. The\nprincipal contribution of this work is the application of Hilbert basis (as one\nrealization of Theorem 3.9) for the resolution of singularities with toric\nvarieties, and a background in Lattice Polytope. In the second part we apply\nthis theorem to problems in statistical learning, principally in a recent area\nas is the Singular Learning Theory. We define the singular machines and the\nproblem of Singular Learning through the computing of learning curves on these\nstatistical machines. We review and compile results on the work of S. Watanabe\nin Singular Learning Theory, ref.; [17], [20], [21], also revising the\nimportant result in [26], about almost the machines are singular, we formalize\nthis theory withtoric resolution morphism in a theorem proved here (Theorem\n5.4), characterizing these Learning Machines as toric varieties, and we\nreproduce results previously published in Singular Statistical Learning seen in\n[19], [20], [23].\n","authors":"M. P. Castillo-Villalba|J. O. González-Cervantes","affiliations":"","link_abstract":"http://arxiv.org/abs/1708.02273v1","link_pdf":"http://arxiv.org/pdf/1708.02273v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.AG","categories":"math.AG|14M25, 52B20|I.5.1"} {"id":"1708.03116v1","submitted":"2017-08-10 08:23:40","updated":"2017-08-10 08:23:40","title":"From Random Walks to Random Leaps: Generalizing Classic Markov Chains\n for Big Data Applications","abstract":" Simple random walks are a basic staple of the foundation of probability\ntheory and form the building block of many useful and complex stochastic\nprocesses. In this paper we study a natural generalization of the random walk\nto a process in which the allowed step sizes take values in the set\n$\\{\\pm1,\\pm2,\\ldots,\\pm k\\}$, a process we call a random leap. The need to\nanalyze such models arises naturally in modern-day data science and so-called\n\"big data\" applications. We provide closed-form expressions for quantities\nassociated with first passage times and absorption events of random leaps.\nThese expressions are formulated in terms of the roots of the characteristic\npolynomial of a certain recurrence relation associated with the transition\nprobabilities. Our analysis shows that the expressions for absorption\nprobabilities for the classical simple random walk are a special case of a\nuniversal result that is very elegant. We also consider an important variant of\na random leap: the reflecting random leap. We demonstrate that the reflecting\nrandom leap exhibits more interesting behavior in regard to the existence of a\nstationary distribution and properties thereof. Questions relating to\nrecurrence/transience are also addressed, as well as an application of the\nrandom leap.\n","authors":"Bala Rajaratnam|Narut Sereewattanawoot|Doug Sparks|Meng-Hsuan Wu","affiliations":"","link_abstract":"http://arxiv.org/abs/1708.03116v1","link_pdf":"http://arxiv.org/pdf/1708.03116v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.PR","categories":"math.PR|60J10 (Primary), 65Q30 (Secondary)"} {"id":"1708.04098v1","submitted":"2017-08-14 12:39:44","updated":"2017-08-14 12:39:44","title":"Statistics Educational Challenge in the 21st Century","abstract":" What do we teach and what should we teach? An honest answer to this question\nis painful, very painful--what we teach lags decades behind what we practice.\nHow can we reduce this `gap' to prepare a data science workforce of trained\nnext-generation statisticians? This is a challenging open problem that requires\nmany well-thought-out experiments before finding the secret sauce. My goal in\nthis article is to lay out some basic principles and guidelines (rather than\ncreating a pseudo-curriculum based on cherry-picked topics) to expedite this\nprocess for finding an `objective' solution.\n","authors":"Subhadeep Mukhopadhyay","affiliations":"","link_abstract":"http://arxiv.org/abs/1708.04098v1","link_pdf":"http://arxiv.org/pdf/1708.04098v1","link_doi":"","comment":"Invited Opinion Article","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT"} {"id":"1708.04699v1","submitted":"2017-08-15 21:51:47","updated":"2017-08-15 21:51:47","title":"Mechanism Redesign","abstract":" This paper develops the theory of mechanism redesign by which an auctioneer\ncan reoptimize an auction based on bid data collected from previous iterations\nof the auction on bidders from the same market. We give a direct method for\nestimation of the revenue of a counterfactual auction from the bids in the\ncurrent auction. The estimator is a simple weighted order statistic of the bids\nand has the optimal error rate. Two applications of our estimator are A/B\ntesting (a.k.a., randomized controlled trials) and instrumented optimization\n(i.e., revenue optimization subject to being able to do accurate inference of\nany counterfactual auction revenue).\n","authors":"Shuchi Chawla|Jason D. Hartline|Denis Nekipelov","affiliations":"","link_abstract":"http://arxiv.org/abs/1708.04699v1","link_pdf":"http://arxiv.org/pdf/1708.04699v1","link_doi":"","comment":"This paper combines and improves upon results from manuscripts\n \"Mechanism Design for Data Science\" (arXiv:1404.5971) and \"A/B Testing of\n Auctions\" (arXiv:1606.00908)","journal_ref":"","doi":"","primary_category":"cs.GT","categories":"cs.GT"} {"id":"1708.04789v1","submitted":"2017-08-16 06:53:05","updated":"2017-08-16 06:53:05","title":"revisit: a Workflow Tool for Data Science","abstract":" In recent years there has been widespread concern in the scientific community\nover a reproducibility crisis. Among the major causes that have been identified\nis statistical: In many scientific research the statistical analysis (including\ndata preparation) suffers from a lack of transparency and methodological\nproblems, major obstructions to reproducibility. The revisit package aims\ntoward remedying this problem, by generating a \"software paper trail\" of the\nstatistical operations applied to a dataset. This record can be \"replayed\" for\nverification purposes, as well as be modified to enable alternative analyses.\nThe software also issues warnings of certain kinds of potential errors in\nstatistical methodology, again related to the reproducibility issue.\n","authors":"Norman Matloff|Reed Davis|Laurel Beckett|Paul Thompson","affiliations":"","link_abstract":"http://arxiv.org/abs/1708.04789v1","link_pdf":"http://arxiv.org/pdf/1708.04789v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.AP","categories":"stat.AP|cs.CY"} {"id":"1708.05279v2","submitted":"2017-08-17 13:59:48","updated":"2017-08-30 15:41:17","title":"Designing and building the mlpack open-source machine learning library","abstract":" mlpack is an open-source C++ machine learning library with an emphasis on\nspeed and flexibility. Since its original inception in 2007, it has grown to be\na large project implementing a wide variety of machine learning algorithms,\nfrom standard techniques such as decision trees and logistic regression to\nmodern techniques such as deep neural networks as well as other\nrecently-published cutting-edge techniques not found in any other library.\nmlpack is quite fast, with benchmarks showing mlpack outperforming other\nlibraries' implementations of the same methods. mlpack has an active community,\nwith contributors from around the world---including some from PUST. This short\npaper describes the goals and design of mlpack, discusses how the open-source\ncommunity functions, and shows an example usage of mlpack for a simple data\nscience problem.\n","authors":"Ryan R. Curtin|Marcus Edel","affiliations":"","link_abstract":"http://arxiv.org/abs/1708.05279v2","link_pdf":"http://arxiv.org/pdf/1708.05279v2","link_doi":"","comment":"submitted to ICOPUST 2017","journal_ref":"","doi":"","primary_category":"cs.MS","categories":"cs.MS|cs.LG|cs.SE"} {"id":"1708.08354v1","submitted":"2017-08-28 14:53:30","updated":"2017-08-28 14:53:30","title":"Recent implementations, applications, and extensions of the Locally\n Optimal Block Preconditioned Conjugate Gradient method (LOBPCG)","abstract":" Since introduction [A. Knyazev, Toward the optimal preconditioned\neigensolver: Locally optimal block preconditioned conjugate gradient method,\nSISC (2001) DOI:10.1137/S1064827500366124] and efficient parallel\nimplementation [A. Knyazev et al., Block locally optimal preconditioned\neigenvalue xolvers (BLOPEX) in HYPRE and PETSc, SISC (2007)\nDOI:10.1137/060661624], LOBPCG has been used is a wide range of applications in\nmechanics, material sciences, and data sciences. We review its recent\nimplementations and applications, as well as extensions of the local optimality\nidea beyond standard eigenvalue problems.\n","authors":"Andrew Knyazev","affiliations":"","link_abstract":"http://arxiv.org/abs/1708.08354v1","link_pdf":"http://arxiv.org/pdf/1708.08354v1","link_doi":"","comment":"4 pages. Householder Symposium on Numerical Linear Algebra, June 2017","journal_ref":"","doi":"","primary_category":"cs.NA","categories":"cs.NA|math.NA|stat.CO|65F15|G.1.3"} {"id":"1709.01233v8","submitted":"2017-09-05 04:19:38","updated":"2020-07-13 16:01:37","title":"Supervised Dimensionality Reduction for Big Data","abstract":" To solve key biomedical problems, experimentalists now routinely measure\nmillions or billions of features (dimensions) per sample, with the hope that\ndata science techniques will be able to build accurate data-driven inferences.\nBecause sample sizes are typically orders of magnitude smaller than the\ndimensionality of these data, valid inferences require finding a\nlow-dimensional representation that preserves the discriminating information\n(e.g., whether the individual suffers from a particular disease). Existing\nlinear and nonlinear dimensionality reduction methods either are not\nsupervised, scale poorly to operate in big data regimes, lack theoretical\nguarantees, or are \"black-box\" methods unsuitable for many applications. We\nintroduce \"Linear Optimal Low-rank\" projection (LOL), which extends principle\ncomponents analysis by incorporating, rather than ignoring, class labels, and\nfacilitates straightforward generalizations. We prove, and substantiate with\nboth synthetic and real data benchmarks, that LOL leads to an improved data\nrepresentation for subsequent classification, while maintaining computational\nefficiency and scalability. Using multiple brain imaging datasets consisting of\n>150 million features, and several genomics datasets with >500,000 features,\nLOL achieves achieves state-of-the-art classification accuracy, while only\nrequiring a few minutes on a standard desktop computer.\n","authors":"Joshua T. Vogelstein|Eric Bridgeford|Minh Tang|Da Zheng|Christopher Douville|Randal Burns|Mauro Maggioni","affiliations":"","link_abstract":"http://arxiv.org/abs/1709.01233v8","link_pdf":"http://arxiv.org/pdf/1709.01233v8","link_doi":"","comment":"6 figures","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML"} {"id":"1709.01989v1","submitted":"2017-09-06 20:38:00","updated":"2017-09-06 20:38:00","title":"Artificial Intelligence and Data Science in the Automotive Industry","abstract":" Data science and machine learning are the key technologies when it comes to\nthe processes and products with automatic learning and optimization to be used\nin the automotive industry of the future. This article defines the terms \"data\nscience\" (also referred to as \"data analytics\") and \"machine learning\" and how\nthey are related. In addition, it defines the term \"optimizing analytics\" and\nillustrates the role of automatic optimization as a key technology in\ncombination with data analytics. It also uses examples to explain the way that\nthese technologies are currently being used in the automotive industry on the\nbasis of the major subprocesses in the automotive value chain (development,\nprocurement; logistics, production, marketing, sales and after-sales, connected\ncustomer). Since the industry is just starting to explore the broad range of\npotential uses for these technologies, visionary application examples are used\nto illustrate the revolutionary possibilities that they offer. Finally, the\narticle demonstrates how these technologies can make the automotive industry\nmore efficient and enhance its customer focus throughout all its operations and\nactivities, extending from the product and its development process to the\ncustomers and their connection to the product.\n","authors":"Martin Hofmann|Florian Neukart|Thomas Bäck","affiliations":"","link_abstract":"http://arxiv.org/abs/1709.01989v1","link_pdf":"http://arxiv.org/pdf/1709.01989v1","link_doi":"","comment":"22 pages, 4 figures","journal_ref":"https://data-science-blog.com/blog/2017/05/06/artificial-intelligence-and-data-science-in-the-automotive-industry/","doi":"","primary_category":"cs.AI","categories":"cs.AI|cs.CY"} {"id":"1709.02129v1","submitted":"2017-09-07 08:16:03","updated":"2017-09-07 08:16:03","title":"Data science for assessing possible tax income manipulation: The case of\n Italy","abstract":" This paper explores a real-world fundamental theme under a data science\nperspective. It specifically discusses whether fraud or manipulation can be\nobserved in and from municipality income tax size distributions, through their\naggregation from citizen fiscal reports. The study case pertains to official\ndata obtained from the Italian Ministry of Economics and Finance over the\nperiod 2007-2011. All Italian (20) regions are considered. The considered data\nscience approach concretizes in the adoption of the Benford first digit law as\nquantitative tool. Marked disparities are found, - for several regions, leading\nto unexpected \"conclusions\". The most eye browsing regions are not the expected\nones according to classical imagination about Italy financial shadow matters.\n","authors":"Marcel Ausloos|Roy Cerqueti|Tariq A. Mir","affiliations":"","link_abstract":"http://arxiv.org/abs/1709.02129v1","link_pdf":"http://arxiv.org/pdf/1709.02129v1","link_doi":"http://dx.doi.org/10.1016/j.chaos.2017.08.012","comment":"38 pages, 22 figures. To be published in Chaos, Solitons and Fractals","journal_ref":"Chaos, Solitons & Fractals 104 (2017) 238-256","doi":"10.1016/j.chaos.2017.08.012","primary_category":"q-fin.ST","categories":"q-fin.ST|physics.soc-ph|91B80, 62P20"} {"id":"1709.02510v1","submitted":"2017-09-08 02:40:35","updated":"2017-09-08 02:40:35","title":"\"Breaking\" Disasters: Predicting and Characterizing the Global News\n Value of Natural and Man-made Disasters","abstract":" Due to their often unexpected nature, natural and man-made disasters are\ndifficult to monitor and detect for journalists and disaster management\nresponse teams. Journalists are increasingly relying on signals from social\nmedia to detect such stories in their early stage of development. Twitter,\nwhich features a vast network of local news outlets, is a major source of early\nsignal for disaster detection. Journalists who work for global desks often\nfollow these sources via Twitter's lists, but have to comb through thousands of\nsmall-scale or low-impact stories to find events that may be globally relevant.\nThese are events that have a large scope, high impact, or potential\ngeo-political relevance. We propose a model for automatically identifying\nevents from local news sources that may break on a global scale within the next\n24 hours. The results are promising and can be used in a predictive setting to\nhelp journalists manage their sources more effectively, or in a descriptive\nmanner to analyze media coverage of disasters. Through the feature evaluation\nprocess, we also address the question: \"what makes a disaster event newsworthy\non a global scale?\" As part of our data collection process, we have created a\nlist of local sources of disaster/accident news on Twitter, which we have made\npublicly available.\n","authors":"Armineh Nourbakhsh|Quanzhi Li|Xiaomo Liu|Sameena Shah","affiliations":"","link_abstract":"http://arxiv.org/abs/1709.02510v1","link_pdf":"http://arxiv.org/pdf/1709.02510v1","link_doi":"","comment":"Accepted by KDD 2017 Data Science + Journalism workshop","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI|physics.soc-ph"} {"id":"1710.04120v1","submitted":"2017-09-09 07:27:09","updated":"2017-09-09 07:27:09","title":"Decision Support System for Urbanization of the Northern Part of the\n Volga-Akhtuba Floodplain (Russia) on the Basis of Interdisciplinary Computer\n Modeling","abstract":" There is a computer decision support system (CDSS) for urbanization of the\nnorthern part of the Volga-Akhtuba floodplain. This system includes subsystems\nof cognitive and game-theoretic analysis, geoinformation and hydrodynamic\nsimulations. The paper presents the cognitive graph, two-level and three-level\nmodels of hierarchical games for the cases of uncontrolled and controlled\ndevelopment of the problem situation. We described the quantitative analysis of\nthe effects of different strategies for the spatial distribution of the\nurbanized territories. For this reason we conducted the territory zoning\naccording to the level of negative consequences of urbanization for various\nagents. In addition, we found an analytical solution for games with the linear\ndependence of the average flooded area on the urbanized area. We numerically\ncomputed a game equilibrium for dependences derived from the imitational\ngeoinformation and hydrodynamic modeling of flooding. As the result, we showed\nthat the transition to the three-level management system and the implementation\nof an optimal urbanization strategy minimize its negative consequences.\n","authors":"Alexander Voronin|Inessa Isaeva|Alexander Khoperskov|Sergey Grebenjuk","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.04120v1","link_pdf":"http://arxiv.org/pdf/1710.04120v1","link_doi":"http://dx.doi.org/10.1007/978-3-319-65551-2_30","comment":"14 pages, 5 figures; Conference: Creativity in Intelligent\n Technologies and Data Science. CIT&DS 2017","journal_ref":"Communications in Computer and Information Science, 2017, vol.\n 754, p. 419-429","doi":"10.1007/978-3-319-65551-2_30","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1709.03904v4","submitted":"2017-09-12 15:16:23","updated":"2019-01-04 18:23:01","title":"A Tutorial on Statistically Sound Pattern Discovery","abstract":" Statistically sound pattern discovery harnesses the rigour of statistical\nhypothesis testing to overcome many of the issues that have hampered standard\ndata mining approaches to pattern discovery. Most importantly, application of\nappropriate statistical tests allows precise control over the risk of false\ndiscoveries -- patterns that are found in the sample data but do not hold in\nthe wider population from which the sample was drawn. Statistical tests can\nalso be applied to filter out patterns that are unlikely to be useful, removing\nuninformative variations of the key patterns in the data. This tutorial\nintroduces the key statistical and data mining theory and techniques that\nunderpin this fast developing field.\n We concentrate on two general classes of patterns: dependency rules that\nexpress statistical dependencies between condition and consequent parts and\ndependency sets that express mutual dependence between set elements. We clarify\nalternative interpretations of statistical dependence and introduce appropriate\ntests for evaluating statistical significance of patterns in different\nsituations. We also introduce special techniques for controlling the likelihood\nof spurious discoveries when multitudes of patterns are evaluated.\n The paper is aimed at a wide variety of audiences. It provides the necessary\nstatistical background and summary of the state-of-the-art for any data mining\nresearcher or practitioner wishing to enter or understand statistically sound\npattern discovery research or practice. It can serve as a general introduction\nto the field of statistically sound pattern discovery for any reader with a\ngeneral background in data sciences.\n","authors":"Wilhelmiina Hämäläinen|Geoffrey I. Webb","affiliations":"","link_abstract":"http://arxiv.org/abs/1709.03904v4","link_pdf":"http://arxiv.org/pdf/1709.03904v4","link_doi":"http://dx.doi.org/10.1007/s10618-018-0590-x","comment":"51 pages. This is a prepublication version of an open-access journal\n paper. This version presents the original math notations that were\n compromised in the published version","journal_ref":"Data Mining and Knowledge Discovery, First Online 20 December 2018","doi":"10.1007/s10618-018-0590-x","primary_category":"stat.ME","categories":"stat.ME"} {"id":"1709.05156v1","submitted":"2017-09-15 11:26:07","updated":"2017-09-15 11:26:07","title":"Trend Detection based Regret Minimization for Bandit Problems","abstract":" We study a variation of the classical multi-armed bandits problem. In this\nproblem, the learner has to make a sequence of decisions, picking from a fixed\nset of choices. In each round, she receives as feedback only the loss incurred\nfrom the chosen action. Conventionally, this problem has been studied when\nlosses of the actions are drawn from an unknown distribution or when they are\nadversarial. In this paper, we study this problem when the losses of the\nactions also satisfy certain structural properties, and especially, do show a\ntrend structure. When this is true, we show that using \\textit{trend\ndetection}, we can achieve regret of order $\\tilde{O} (N \\sqrt{TK})$ with\nrespect to a switching strategy for the version of the problem where a single\naction is chosen in each round and $\\tilde{O} (Nm \\sqrt{TK})$ when $m$ actions\nare chosen each round. This guarantee is a significant improvement over the\nconventional benchmark. Our approach can, as a framework, be applied in\ncombination with various well-known bandit algorithms, like Exp3. For both\nversions of the problem, we give regret guarantees also for the\n\\textit{anytime} setting, i.e. when the length of the choice-sequence is not\nknown in advance. Finally, we pinpoint the advantages of our method by\ncomparing it to some well-known other strategies.\n","authors":"Paresh Nakhe|Rebecca Reiffenhäuser","affiliations":"","link_abstract":"http://arxiv.org/abs/1709.05156v1","link_pdf":"http://arxiv.org/pdf/1709.05156v1","link_doi":"http://dx.doi.org/10.1109/DSAA.2016.35","comment":"","journal_ref":"2016 IEEE International Conference on Data Science and Advanced\n Analytics (DSAA), Montreal, QC, 2016, pp. 263-271","doi":"10.1109/DSAA.2016.35","primary_category":"cs.LG","categories":"cs.LG"} {"id":"1709.05551v1","submitted":"2017-09-16 18:57:37","updated":"2017-09-16 18:57:37","title":"Applying Machine Learning Methods to Enhance the Distribution of Social\n Services in Mexico","abstract":" The Government of Mexico's social development agency, SEDESOL, is responsible\nfor the administration of social services and has the mission of lifting\nMexican families out of poverty. One key challenge they face is matching people\nwho have social service needs with the services SEDESOL can provide accurately\nand efficiently. In this work we describe two specific applications implemented\nin collaboration with SEDESOL to enhance their distribution of social services.\nThe first problem relates to systematic underreporting on applications for\nsocial services, which makes it difficult to identify where to prioritize\noutreach. Responding that five people reside in a home when only three do is a\ntype of underreporting that could occur while a social worker conducts a home\nsurvey with a family to determine their eligibility for services. The second\ninvolves approximating multidimensional poverty profiles across households.\nThat is, can we characterize different types of vulnerabilities -- for example,\nfood insecurity and lack of health services -- faced by those in poverty?\n We detail the problem context, available data, our machine learning\nformulation, experimental results, and effective feature sets. As far as we are\naware this is the first time government data of this scale has been used to\ncombat poverty within Mexico. We found that survey data alone can suggest\npotential underreporting.\n Further, we found geographic features useful for housing and service related\nindicators and transactional data informative for other dimensions of poverty.\nThe results from our machine learning system for estimating poverty profiles\nwill directly help better match 7.4 million individuals to social programs.\n","authors":"Kris Sankaran|Diego Garcia-Olano|Mobin Javed|Maria Fernanda Alcala-Durand|Adolfo De Unánue|Paul van der Boor|Eric Potash|Roberto Sánchez Avalos|Luis Iñaki Alberro Encinas|Rayid Ghani","affiliations":"","link_abstract":"http://arxiv.org/abs/1709.05551v1","link_pdf":"http://arxiv.org/pdf/1709.05551v1","link_doi":"","comment":"This work was done as part of the 2016 Eric & Wendy Schmidt Data\n Science for Social Good Summer Fellowship at the University of Chicago","journal_ref":"","doi":"","primary_category":"stat.AP","categories":"stat.AP"} {"id":"1709.06176v1","submitted":"2017-09-18 21:55:47","updated":"2017-09-18 21:55:47","title":"Zooming in on NYC taxi data with Portal","abstract":" In this paper we develop a methodology for analyzing transportation data at\ndifferent levels of temporal and geographic granularity, and apply our\nmethodology to the TLC Trip Record Dataset, made publicly available by the NYC\nTaxi & Limousine Commission. This data is naturally represented by a set of\ntrajectories, annotated with time and with additional information such as\npassenger count and cost. We analyze TLC data to identify hotspots, which point\nto lack of convenient public transportation options, and popular routes, which\nmotivate ride-sharing solutions or addition of a bus route.\n Our methodology is based on using a system called Portal, which implements\nefficient representations and principled analysis methods for evolving graphs.\nPortal is implemented on top of Apache Spark, a popular distributed data\nprocessing system, is inter-operable with other Spark libraries like SparkSQL,\nand supports sophisticated kinds of analysis of evolving graphs efficiently.\nPortal is currently under development in the Data, Responsibly Lab at Drexel.\nWe plan to release Portal in the open source in Fall 2017.\n","authors":"Julia Stoyanovich|Matthew Gilbride|Vera Zaychik Moffitt","affiliations":"","link_abstract":"http://arxiv.org/abs/1709.06176v1","link_pdf":"http://arxiv.org/pdf/1709.06176v1","link_doi":"","comment":"Presented at Data Science for Social Good (DSSG) 2017:\n https://dssg.uchicago.edu/data-science-for-social-good-conference-2017/","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB"} {"id":"1709.06740v2","submitted":"2017-09-20 07:05:40","updated":"2017-10-09 05:37:37","title":"Discovery of the Twitter Bursty Botnet","abstract":" Many Twitter users are bots. They can be used for spamming, opinion\nmanipulation and online fraud. Recently we discovered the Star Wars botnet,\nconsisting of more than 350,000 bots tweeting random quotations exclusively\nfrom Star Wars novels. The bots were exposed because they tweeted uniformly\nfrom any location within two rectangle-shaped geographic zones covering Europe\nand the USA, including sea and desert areas in the zones. In this paper, we\nreport another unusual behaviour of the Star Wars bots, that the bots were\ncreated in bursts or batches, and they only tweeted in their first few minutes\nsince creation. Inspired by this observation, we discovered an even larger\nTwitter botnet, the Bursty botnet with more than 500,000 bots. Our preliminary\nstudy showed that the Bursty botnet was directly responsible for a large-scale\nonline spamming attack in 2012. Most bot detection algorithms have been based\non assumptions of `common' features that were supposedly shared by all bots.\nOur discovered botnets, however, do not show many of those features; instead,\nthey were detected by their distinct, unusual tweeting behaviours that were\nunknown until now.\n","authors":"Juan Echeverria|Christoph Besel|Shi Zhou","affiliations":"","link_abstract":"http://arxiv.org/abs/1709.06740v2","link_pdf":"http://arxiv.org/pdf/1709.06740v2","link_doi":"","comment":"Accepted for publication at : Data Science for Cyber-Security 2017","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI"} {"id":"1709.07095v1","submitted":"2017-09-20 22:21:14","updated":"2017-09-20 22:21:14","title":"Practical Machine Learning for Cloud Intrusion Detection: Challenges and\n the Way Forward","abstract":" Operationalizing machine learning based security detections is extremely\nchallenging, especially in a continuously evolving cloud environment.\nConventional anomaly detection does not produce satisfactory results for\nanalysts that are investigating security incidents in the cloud. Model\nevaluation alone presents its own set of problems due to a lack of benchmark\ndatasets. When deploying these detections, we must deal with model compliance,\nlocalization, and data silo issues, among many others. We pose the problem of\n\"attack disruption\" as a way forward in the security data science space. In\nthis paper, we describe the framework, challenges, and open questions\nsurrounding the successful operationalization of machine learning based\nsecurity detections in a cloud environment and provide some insights on how we\nhave addressed them.\n","authors":"Ram Shankar Siva Kumar|Andrew Wicker|Matt Swann","affiliations":"","link_abstract":"http://arxiv.org/abs/1709.07095v1","link_pdf":"http://arxiv.org/pdf/1709.07095v1","link_doi":"","comment":"10 pages, 9 figures","journal_ref":"","doi":"","primary_category":"cs.CR","categories":"cs.CR|cs.AI"} {"id":"1709.07493v1","submitted":"2017-09-21 18:50:32","updated":"2017-09-21 18:50:32","title":"Big Data Systems Meet Machine Learning Challenges: Towards Big Data\n Science as a Service","abstract":" Recently, we have been witnessing huge advancements in the scale of data we\nroutinely generate and collect in pretty much everything we do, as well as our\nability to exploit modern technologies to process, analyze and understand this\ndata. The intersection of these trends is what is called, nowadays, as Big Data\nScience. Cloud computing represents a practical and cost-effective solution for\nsupporting Big Data storage, processing and for sophisticated analytics\napplications. We analyze in details the building blocks of the software stack\nfor supporting big data science as a commodity service for data scientists. We\nprovide various insights about the latest ongoing developments and open\nchallenges in this domain.\n","authors":"Radwa Elshawi|Sherif Sakr","affiliations":"","link_abstract":"http://arxiv.org/abs/1709.07493v1","link_pdf":"http://arxiv.org/pdf/1709.07493v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB"} {"id":"1709.07534v1","submitted":"2017-09-21 22:38:51","updated":"2017-09-21 22:38:51","title":"MRNet-Product2Vec: A Multi-task Recurrent Neural Network for Product\n Embeddings","abstract":" E-commerce websites such as Amazon, Alibaba, Flipkart, and Walmart sell\nbillions of products. Machine learning (ML) algorithms involving products are\noften used to improve the customer experience and increase revenue, e.g.,\nproduct similarity, recommendation, and price estimation. The products are\nrequired to be represented as features before training an ML algorithm. In this\npaper, we propose an approach called MRNet-Product2Vec for creating generic\nembeddings of products within an e-commerce ecosystem. We learn a dense and\nlow-dimensional embedding where a diverse set of signals related to a product\nare explicitly injected into its representation. We train a Discriminative\nMulti-task Bidirectional Recurrent Neural Network (RNN), where the input is a\nproduct title fed through a Bidirectional RNN and at the output, product labels\ncorresponding to fifteen different tasks are predicted. The task set includes\nseveral intrinsic characteristics about a product such as price, weight, size,\ncolor, popularity, and material. We evaluate the proposed embedding\nquantitatively and qualitatively. We demonstrate that they are almost as good\nas sparse and extremely high-dimensional TF-IDF representation in spite of\nhaving less than 3% of the TF-IDF dimension. We also use a multimodal\nautoencoder for comparing products from different language-regions and show\npreliminary yet promising qualitative results.\n","authors":"Arijit Biswas|Mukul Bhutani|Subhajit Sanyal","affiliations":"","link_abstract":"http://arxiv.org/abs/1709.07534v1","link_pdf":"http://arxiv.org/pdf/1709.07534v1","link_doi":"","comment":"Published in ECML-PKDD 2017 (Applied Data Science Track)","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI|cs.LG|stat.ML"} {"id":"1709.10242v2","submitted":"2017-09-29 05:43:39","updated":"2017-10-04 16:33:07","title":"Intelligence Quotient and Intelligence Grade of Artificial Intelligence","abstract":" Although artificial intelligence is currently one of the most interesting\nareas in scientific research, the potential threats posed by emerging AI\nsystems remain a source of persistent controversy. To address the issue of AI\nthreat, this study proposes a standard intelligence model that unifies AI and\nhuman characteristics in terms of four aspects of knowledge, i.e., input,\noutput, mastery, and creation. Using this model, we observe three challenges,\nnamely, expanding of the von Neumann architecture; testing and ranking the\nintelligence quotient of naturally and artificially intelligent systems,\nincluding humans, Google, Bing, Baidu, and Siri; and finally, the dividing of\nartificially intelligent systems into seven grades from robots to Google Brain.\nBased on this, we conclude that AlphaGo belongs to the third grade.\n","authors":"Feng Liu|Yong Shi|Ying Liu","affiliations":"","link_abstract":"http://arxiv.org/abs/1709.10242v2","link_pdf":"http://arxiv.org/pdf/1709.10242v2","link_doi":"http://dx.doi.org/10.1007/s40745-017-0109-0","comment":"","journal_ref":"Annals of Data Science, June 2017, Volume 4, Issue 2, pp 179-191","doi":"10.1007/s40745-017-0109-0","primary_category":"cs.AI","categories":"cs.AI"} {"id":"1710.00027v1","submitted":"2017-09-29 18:43:23","updated":"2017-09-29 18:43:23","title":"Toward a System Building Agenda for Data Integration","abstract":" In this paper we argue that the data management community should devote far\nmore effort to building data integration (DI) systems, in order to truly\nadvance the field. Toward this goal, we make three contributions. First, we\ndraw on our recent industrial experience to discuss the limitations of current\nDI systems. Second, we propose an agenda to build a new kind of DI systems to\naddress these limitations. These systems guide users through the DI workflow,\nstep by step. They provide tools to address the \"pain points\" of the steps, and\ntools are built on top of the Python data science and Big Data ecosystem\n(PyData). We discuss how to foster an ecosystem of such tools within PyData,\nthen use it to build DI systems for collaborative/cloud/crowd/lay user\nsettings. Finally, we discuss ongoing work at Wisconsin, which suggests that\nthese DI systems are highly promising and building them raises many interesting\nresearch challenges.\n","authors":"AnHai Doan|Adel Ardalan|Jeffrey R. Ballard|Sanjib Das|Yash Govind|Pradap Konda|Han Li|Erik Paulson|Paul Suganthan G. C.|Haojun Zhang","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.00027v1","link_pdf":"http://arxiv.org/pdf/1710.00027v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB"} {"id":"1710.01788v1","submitted":"2017-10-04 20:13:21","updated":"2017-10-04 20:13:21","title":"Multitask Learning using Task Clustering with Applications to Predictive\n Modeling and GWAS of Plant Varieties","abstract":" Inferring predictive maps between multiple input and multiple output\nvariables or tasks has innumerable applications in data science. Multi-task\nlearning attempts to learn the maps to several output tasks simultaneously with\ninformation sharing between them. We propose a novel multi-task learning\nframework for sparse linear regression, where a full task hierarchy is\nautomatically inferred from the data, with the assumption that the task\nparameters follow a hierarchical tree structure. The leaves of the tree are the\nparameters for individual tasks, and the root is the global model that\napproximates all the tasks. We apply the proposed approach to develop and\nevaluate: (a) predictive models of plant traits using large-scale and automated\nremote sensing data, and (b) GWAS methodologies mapping such derived phenotypes\nin lieu of hand-measured traits. We demonstrate the superior performance of our\napproach compared to other methods, as well as the usefulness of discovering\nhierarchical groupings between tasks. Our results suggest that richer genetic\nmapping can indeed be obtained from the remote sensing data. In addition, our\ndiscovered groupings reveal interesting insights from a plant science\nperspective.\n","authors":"Ming Yu|Addie M. Thompson|Karthikeyan Natesan Ramamurthy|Eunho Yang|Aurélie C. Lozano","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.01788v1","link_pdf":"http://arxiv.org/pdf/1710.01788v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML"} {"id":"1710.01931v1","submitted":"2017-10-05 09:17:22","updated":"2017-10-05 09:17:22","title":"Forecasting Player Behavioral Data and Simulating in-Game Events","abstract":" Understanding player behavior is fundamental in game data science. Video\ngames evolve as players interact with the game, so being able to foresee player\nexperience would help to ensure a successful game development. In particular,\ngame developers need to evaluate beforehand the impact of in-game events.\nSimulation optimization of these events is crucial to increase player\nengagement and maximize monetization. We present an experimental analysis of\nseveral methods to forecast game-related variables, with two main aims: to\nobtain accurate predictions of in-app purchases and playtime in an operational\nproduction environment, and to perform simulations of in-game events in order\nto maximize sales and playtime. Our ultimate purpose is to take a step towards\nthe data-driven development of games. The results suggest that, even though the\nperformance of traditional approaches such as ARIMA is still better, the\noutcomes of state-of-the-art techniques like deep learning are promising. Deep\nlearning comes up as a well-suited general model that could be used to forecast\na variety of time series with different dynamic behaviors.\n","authors":"Anna Guitart|Pei Pei Chen|Paul Bertens|África Periáñez","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.01931v1","link_pdf":"http://arxiv.org/pdf/1710.01931v1","link_doi":"http://dx.doi.org/10.1007/978-3-030-03402-3_19","comment":"","journal_ref":"In: Arai K., Kapoor S., Bhatia R. (eds) Advances in Information\n and Communication Networks. FICC 2018. Advances in Intelligent Systems and\n Computing, vol 886. Springer, Cham","doi":"10.1007/978-3-030-03402-3_19","primary_category":"stat.ML","categories":"stat.ML"} {"id":"1710.02264v1","submitted":"2017-10-06 03:19:55","updated":"2017-10-06 03:19:55","title":"Churn Prediction in Mobile Social Games: Towards a Complete Assessment\n Using Survival Ensembles","abstract":" Reducing user attrition, i.e. churn, is a broad challenge faced by several\nindustries. In mobile social games, decreasing churn is decisive to increase\nplayer retention and rise revenues. Churn prediction models allow to understand\nplayer loyalty and to anticipate when they will stop playing a game. Thanks to\nthese predictions, several initiatives can be taken to retain those players who\nare more likely to churn.\n Survival analysis focuses on predicting the time of occurrence of a certain\nevent, churn in our case. Classical methods, like regressions, could be applied\nonly when all players have left the game. The challenge arises for datasets\nwith incomplete churning information for all players, as most of them still\nconnect to the game. This is called a censored data problem and is in the\nnature of churn. Censoring is commonly dealt with survival analysis techniques,\nbut due to the inflexibility of the survival statistical algorithms, the\naccuracy achieved is often poor. In contrast, novel ensemble learning\ntechniques, increasingly popular in a variety of scientific fields, provide\nhigh-class prediction results.\n In this work, we develop, for the first time in the social games domain, a\nsurvival ensemble model which provides a comprehensive analysis together with\nan accurate prediction of churn. For each player, we predict the probability of\nchurning as function of time, which permits to distinguish various levels of\nloyalty profiles. Additionally, we assess the risk factors that explain the\npredicted player survival times. Our results show that churn prediction by\nsurvival ensembles significantly improves the accuracy and robustness of\ntraditional analyses, like Cox regression.\n","authors":"África Periáñez|Alain Saas|Anna Guitart|Colin Magne","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.02264v1","link_pdf":"http://arxiv.org/pdf/1710.02264v1","link_doi":"http://dx.doi.org/10.1109/DSAA.2016.84","comment":"","journal_ref":"IEEE International Conference on Data Science and Advanced\n Analytics (DSAA), 564--573, 2016","doi":"10.1109/DSAA.2016.84","primary_category":"stat.ML","categories":"stat.ML"} {"id":"1710.02447v1","submitted":"2017-10-06 15:24:57","updated":"2017-10-06 15:24:57","title":"Data science for urban equity: Making gentrification an accessible topic\n for data scientists, policymakers, and the community","abstract":" The University of Washington eScience Institute runs an annual Data Science\nfor Social Good (DSSG) program that selects four projects each year to train\nstudents from a wide range of disciplines while helping community members\nexecute social good projects, often with an urban focus.\n We present observations and deliberations of one such project, the DSSG 2017\n'Equitable Futures' project, which investigates the ongoing gentrification\nprocess and the increasingly inequitable access to opportunities in Seattle.\nSimilar processes can be observed in many major cities. The project connects\nissues usually analyzed in the disciplines of the built environment, geography,\nsociology, economics, social work and city governments with data science\nmethodologies and visualizations.\n","authors":"Bernease Herman|Gundula Proksch|Rachel Berney|Hillary Dawkins|Jacob Kovacs|Yahui Ma|Jacob Rich|Amanda Tan","affiliations":"U. of Washington|U. of Washington|U. of Washington|U. of Washington|U. of Washington|U. of Washington|U. of Wisconsin|U. of Washington","link_abstract":"http://arxiv.org/abs/1710.02447v1","link_pdf":"http://arxiv.org/pdf/1710.02447v1","link_doi":"","comment":"Presented at the Data For Good Exchange 2017","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1710.02453v1","submitted":"2017-10-06 15:36:37","updated":"2017-10-06 15:36:37","title":"Exploring the Urban - Rural Incarceration Divide: Drivers of Local Jail\n Incarceration Rates in the U.S","abstract":" As the rate of incarceration in the United States continues to grow, a large\nbody of research has been primarily focused on understanding the determinants\nand drivers of federal and state prison growth. However, local jail systems,\nwith 11 million admissions each year, have generated less research attention\neven though they have a far broader impact on communities. Preliminary time\ntrend analysis conducted by the Vera Institute of Justice (Vera) uncovered\ndisparities in county jail incarceration rates by geography. Contrary to\nassumptions that incarceration is an urban phenomenon, Vera discovered that\nduring the past few decades, pretrial jail rates have declined in many urban\nareas whereas rates have grown or remained flat in rural counties. In an effort\nto uncover the factors contributing to continued jail growth in rural areas,\nVera joined forces with Two Sigma's Data Clinic, a volunteer-based program that\nleverages employees' data science expertise. Using county jail data from 2000 -\n2013 and county-specific demographic, political, socioeconomic, jail and prison\npopulation variables, a generalized estimating equations (GEE) model was\nspecified to account for correlations within counties over time. The results\nrevealed that county-level poverty, police expenditures, and spillover effects\nfrom other county and state authorities are all significant predictors of local\njail rates. In addition, geographic investigation of model residuals revealed\nclusters of counties where observed rates were much higher (and much lower)\nthan expected conditioned upon county variables.\n","authors":"Rachael Weiss Riley|Jacob Kang-Brown|Chris Mulligan|Vinod Valsalam|Soumyo Chakraborty|Christian Henrichson","affiliations":"Two Sigma Data Clinic|Vera Institute of Justice|Two Sigma Data Clinic|Two Sigma Data Clinic|Two Sigma Data Clinic|Vera Institute of Justice","link_abstract":"http://arxiv.org/abs/1710.02453v1","link_pdf":"http://arxiv.org/pdf/1710.02453v1","link_doi":"","comment":"Presented at the Data For Good Exchange 2017","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1710.02454v1","submitted":"2017-10-06 15:37:22","updated":"2017-10-06 15:37:22","title":"Using data science as a community advocacy tool to promote equity in\n urban renewal programs: An analysis of Atlanta's Anti-Displacement Tax Fund","abstract":" Cities across the United States are undergoing great transformation and urban\ngrowth. Data and data analysis has become an essential element of urban\nplanning as cities use data to plan land use and development. One great\nchallenge is to use the tools of data science to promote equity along with\ngrowth. The city of Atlanta is an example site of large-scale urban renewal\nthat aims to engage in development without displacement. On the Westside of\ndowntown Atlanta, the construction of the new Mercedes-Benz Stadium and the\nconversion of an underutilized rail-line into a multi-use trail may result in\nincreased property values. In response to community residents' concerns and a\ncommitment to development without displacement, the city and philanthropic\npartners announced an Anti-Displacement Tax Fund to subsidize future property\ntax increases of owner occupants for the next twenty years. To achieve greater\ntransparency, accountability, and impact, residents expressed a desire for a\ntool that would help them determine eligibility and quantify this commitment.\nIn support of this goal, we use machine learning techniques to analyze\nhistorical tax assessment and predict future tax assessments. We then apply\neligibility estimates to our predictions to estimate the total cost for the\nfirst seven years of the program. These forecasts are also incorporated into an\ninteractive tool for community residents to determine their eligibility for the\nfund and the expected increase in their home value over the next seven years.\n","authors":"Jeremy Auerbach|Hayley Barton|Takeria Blunt|Vishwamitra Chaganti|Bhavya Ghai|Amanda Meng|Christopher Blackburn|Ellen Zegura|Pamela Flores","affiliations":"University of Tennessee|Duke University|Spelman College|Georgia State University|Stony Brook University|Georgia Institute of Technology|Georgia Institute of Technology|Georgia Institute of Technology|HELP Organization Incorporate","link_abstract":"http://arxiv.org/abs/1710.02454v1","link_pdf":"http://arxiv.org/pdf/1710.02454v1","link_doi":"","comment":"Presented at the Data For Good Exchange 2017","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1710.02595v2","submitted":"2017-10-06 21:42:15","updated":"2017-10-10 05:05:58","title":"Intelligent Pothole Detection and Road Condition Assessment","abstract":" Poor road conditions are a public nuisance, causing passenger discomfort,\ndamage to vehicles, and accidents. In the U.S., road-related conditions are a\nfactor in 22,000 of the 42,000 traffic fatalities each year. Although we often\ncomplain about bad roads, we have no way to detect or report them at scale. To\naddress this issue, we developed a system to detect potholes and assess road\nconditions in real-time. Our solution is a mobile application that captures\ndata on a car's movement from gyroscope and accelerometer sensors in the phone.\nTo assess roads using this sensor data, we trained SVM models to classify road\nconditions with 93% accuracy and potholes with 92% accuracy, beating the base\nrate for both problems. As the user drives, the models use the sensor data to\nclassify whether the road is good or bad, and whether it contains potholes.\nThen, the classification results are used to create data-rich maps that\nillustrate road conditions across the city. Our system will empower civic\nofficials to identify and repair damaged roads which inconvenience passengers\nand cause accidents. This paper details our data science process for collecting\ntraining data on real roads, transforming noisy sensor data into useful\nsignals, training and evaluating machine learning models, and deploying those\nmodels to production through a real-time classification app. It also highlights\nhow cities can use our system to crowdsource data and deliver road repair\nresources to areas in need.\n","authors":"Umang Bhatt|Shouvik Mani|Edgar Xi|J. Zico Kolter","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.02595v2","link_pdf":"http://arxiv.org/pdf/1710.02595v2","link_doi":"","comment":"Presented at the Data For Good Exchange 2017","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1710.02862v1","submitted":"2017-10-08 17:43:23","updated":"2017-10-08 17:43:23","title":"Exploration of Heterogeneous Data Using Robust Similarity","abstract":" Heterogeneous data pose serious challenges to data analysis tasks, including\nexploration and visualization. Current techniques often utilize dimensionality\nreductions, aggregation, or conversion to numerical values to analyze\nheterogeneous data. However, the effectiveness of such techniques to find\nsubtle structures such as the presence of multiple modes or detection of\noutliers is hindered by the challenge to find the proper subspaces or prior\nknowledge to reveal the structures. In this paper, we propose a generic\nsimilarity-based exploration technique that is applicable to a wide variety of\ndatatypes and their combinations, including heterogeneous ensembles. The\nproposed concept of similarity has a close connection to statistical analysis\nand can be deployed for summarization, revealing fine structures such as the\npresence of multiple modes, and detection of anomalies or outliers. We then\npropose a visual encoding framework that enables the exploration of a\nheterogeneous dataset in different levels of detail and provides insightful\ninformation about both global and local structures. We demonstrate the utility\nof the proposed technique using various real datasets, including ensemble data.\n","authors":"Mahsa Mirzargar|Ross T. Whitaker|Robert M. Kirby","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.02862v1","link_pdf":"http://arxiv.org/pdf/1710.02862v1","link_doi":"","comment":"Presented at Visualization in Data Science (VDS at IEEE VIS 2017)","journal_ref":"","doi":"","primary_category":"cs.GR","categories":"cs.GR|cs.HC"} {"id":"1710.03040v1","submitted":"2017-10-09 11:40:36","updated":"2017-10-09 11:40:36","title":"Run Time Prediction for Big Data Iterative ML Algorithms: a KMeans case\n study","abstract":" Data science and machine learning algorithms running on big data\ninfrastructure are increasingly important in activities ranging from business\nintelligence and analytics to cybersecurity, smart city management, and many\nfields of science and engineering. As these algorithms are further integrated\ninto daily operations, understanding how long they take to run on a big data\ninfrastructure is paramount to controlling costs and delivery times. In this\npaper we discuss the issues involved in understanding the run time of iterative\nmachine learning algorithms and provide a case study of such an algorithm -\nincluding a statistical characterization and model of the run time of an\nimplementation of K-Means for the Spark big data engine using the Edward\nprobabilistic programming language.\n","authors":"Eduardo Rodrigues|Ricardo Morla","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.03040v1","link_pdf":"http://arxiv.org/pdf/1710.03040v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DC","categories":"cs.DC"} {"id":"1710.03410v1","submitted":"2017-10-10 05:59:51","updated":"2017-10-10 05:59:51","title":"A Decision Theoretic Approach to A/B Testing","abstract":" A/B testing is ubiquitous within the machine learning and data science\noperations of internet companies. Generically, the idea is to perform a\nstatistical test of the hypothesis that a new feature is better than the\nexisting platform---for example, it results in higher revenue. If the p value\nfor the test is below some pre-defined threshold---often, 0.05---the new\nfeature is implemented. The difficulty of choosing an appropriate threshold has\nbeen noted before, particularly because dependent tests are often done\nsequentially, leading some to propose control of the false discovery rate (FDR)\nrather than use of a single, universal threshold. However, it is still\nnecessary to make an arbitrary choice of the level at which to control FDR.\nHere we suggest a decision-theoretic approach to determining whether to adopt a\nnew feature, which enables automated selection of an appropriate threshold. Our\nmethod has the basic ingredients of any decision-theory problem: a loss\nfunction, action space, and a notion of optimality, for which we choose Bayes\nrisk. However, the loss function and the action space differ from the typical\nchoices made in the literature, which has focused on the theory of point\nestimation. We give some basic results for Bayes-optimal thresholding rules for\nthe feature adoption decision, and give some examples using eBay data. The\nresults suggest that the 0.05 p-value threshold may be too conservative in some\nsettings, but that its widespread use may reflect an ad-hoc means of\ncontrolling multiplicity in the common case of repeatedly testing variants of\nan experiment when the threshold is not reached.\n","authors":"David Goldberg|James E. Johndrow","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.03410v1","link_pdf":"http://arxiv.org/pdf/1710.03410v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|stat.ME|stat.TH|62"} {"id":"1710.04019v1","submitted":"2017-10-11 11:53:32","updated":"2017-10-11 11:53:32","title":"An introduction to Topological Data Analysis: fundamental and practical\n aspects for data scientists","abstract":" Topological Data Analysis (tda) is a recent and fast growing eld providing a\nset of new topological and geometric tools to infer relevant features for\npossibly complex data. This paper is a brief introduction, through a few\nselected topics, to basic fundamental and practical aspects of tda for non\nexperts. 1 Introduction and motivation Topological Data Analysis (tda) is a\nrecent eld that emerged from various works in applied (algebraic) topology and\ncomputational geometry during the rst decade of the century. Although one can\ntrace back geometric approaches for data analysis quite far in the past, tda\nreally started as a eld with the pioneering works of Edelsbrunner et al. (2002)\nand Zomorodian and Carlsson (2005) in persistent homology and was popularized\nin a landmark paper in 2009 Carlsson (2009). tda is mainly motivated by the\nidea that topology and geometry provide a powerful approach to infer robust\nqualitative, and sometimes quantitative, information about the structure of\ndata-see, e.g. Chazal (2017). tda aims at providing well-founded mathematical,\nstatistical and algorithmic methods to infer, analyze and exploit the complex\ntopological and geometric structures underlying data that are often represented\nas point clouds in Euclidean or more general metric spaces. During the last few\nyears, a considerable eort has been made to provide robust and ecient data\nstructures and algorithms for tda that are now implemented and available and\neasy to use through standard libraries such as the Gudhi library (C++ and\nPython) Maria et al. (2014) and its R software interface Fasy et al. (2014a).\nAlthough it is still rapidly evolving, tda now provides a set of mature and\necient tools that can be used in combination or complementary to other data\nsciences tools. The tdapipeline. tda has recently known developments in various\ndirections and application elds. There now exist a large variety of methods\ninspired by topological and geometric approaches. Providing a complete overview\nof all these existing approaches is beyond the scope of this introductory\nsurvey. However, most of them rely on the following basic and standard pipeline\nthat will serve as the backbone of this paper: 1. The input is assumed to be a\nnite set of points coming with a notion of distance-or similarity between them.\nThis distance can be induced by the metric in the ambient space (e.g. the\nEuclidean metric when the data are embedded in R d) or come as an intrinsic\nmetric dened by a pairwise distance matrix. The denition of the metric on the\ndata is usually given as an input or guided by the application. It is however\nimportant to notice that the choice of the metric may be critical to reveal\ninteresting topological and geometric features of the data.\n","authors":"Frédéric Chazal|Bertrand Michel","affiliations":"DATASHAPE|LSTA","link_abstract":"http://arxiv.org/abs/1710.04019v1","link_pdf":"http://arxiv.org/pdf/1710.04019v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|cs.LG|math.AT|stat.ML|stat.TH"} {"id":"1710.04226v1","submitted":"2017-10-11 18:00:08","updated":"2017-10-11 18:00:08","title":"Machine Learning Bell Nonlocality in Quantum Many-body Systems","abstract":" Machine learning, the core of artificial intelligence and big data science,\nis one of today's most rapidly growing interdisciplinary fields. Recently, its\ntools and techniques have been adopted to tackle intricate quantum many-body\nproblems. In this work, we introduce machine learning techniques to the\ndetection of quantum nonlocality in many-body systems, with a focus on the\nrestricted-Boltzmann-machine (RBM) architecture. Using reinforcement learning,\nwe demonstrate that RBM is capable of finding the maximum quantum violations of\nmultipartite Bell inequalities with given measurement settings. Our results\nbuild a novel bridge between computer-science-based machine learning and\nquantum many-body nonlocality, which will benefit future studies in both areas.\n","authors":"Dong-Ling Deng","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.04226v1","link_pdf":"http://arxiv.org/pdf/1710.04226v1","link_doi":"http://dx.doi.org/10.1103/PhysRevLett.120.240402","comment":"Main Text: 7 pages, 3 figures. Supplementary Material: 2 pages, 3\n figures","journal_ref":"Phys. Rev. Lett. 120, 240402 (2018)","doi":"10.1103/PhysRevLett.120.240402","primary_category":"quant-ph","categories":"quant-ph|cond-mat.dis-nn|cond-mat.quant-gas"} {"id":"1710.05654v2","submitted":"2017-10-16 12:42:15","updated":"2019-05-01 12:30:44","title":"Large Scale Graph Learning from Smooth Signals","abstract":" Graphs are a prevalent tool in data science, as they model the inherent\nstructure of the data. They have been used successfully in unsupervised and\nsemi-supervised learning. Typically they are constructed either by connecting\nnearest samples, or by learning them from data, solving an optimization\nproblem. While graph learning does achieve a better quality, it also comes with\na higher computational cost. In particular, the current state-of-the-art model\ncost is $\\mathcal{O}(n^2)$ for $n$ samples. In this paper, we show how to scale\nit, obtaining an approximation with leading cost of $\\mathcal{O}(n\\log(n))$,\nwith quality that approaches the exact graph learning model. Our algorithm uses\nknown approximate nearest neighbor techniques to reduce the number of\nvariables, and automatically selects the correct parameters of the model,\nrequiring a single intuitive input: the desired edge density.\n","authors":"Vassilis Kalofolias|Nathanaël Perraudin","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.05654v2","link_pdf":"http://arxiv.org/pdf/1710.05654v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1710.06291v1","submitted":"2017-10-17 14:01:55","updated":"2017-10-17 14:01:55","title":"Clear Visual Separation of Temporal Event Sequences","abstract":" Extracting and visualizing informative insights from temporal event sequences\nbecomes increasingly difficult when data volume and variety increase. Besides\ndealing with high event type cardinality and many distinct sequences, it can be\ndifficult to tell whether it is appropriate to combine multiple events into one\nor utilize additional information about event attributes. Existing approaches\noften make use of frequent sequential patterns extracted from the dataset,\nhowever, these patterns are limited in terms of interpretability and utility.\nIn addition, it is difficult to assess the role of absolute and relative time\nwhen using pattern mining techniques.\n In this paper, we present methods that addresses these challenges by\nautomatically learning composite events which enables better aggregation of\nmultiple event sequences. By leveraging event sequence outcomes, we present\nappropriate linked visualizations that allow domain experts to identify\ncritical flows, to assess validity and to understand the role of time.\nFurthermore, we explore information gain and visual complexity metrics to\nidentify the most relevant visual patterns. We compare composite event learning\nwith two approaches for extracting event patterns using real world company\nevent data from an ongoing project with the Danish Business Authority.\n","authors":"Andreas Mathisen|Kaj Grønbæk","affiliations":"Department of Computer Science, Aarhus University|Department of Computer Science, Aarhus University","link_abstract":"http://arxiv.org/abs/1710.06291v1","link_pdf":"http://arxiv.org/pdf/1710.06291v1","link_doi":"","comment":"In Proceedings of the 3rd IEEE Symposium on Visualization in Data\n Science (VDS), 2017","journal_ref":"","doi":"","primary_category":"cs.HC","categories":"cs.HC"} {"id":"1710.06552v4","submitted":"2017-10-18 01:45:07","updated":"2019-02-08 14:41:04","title":"Relaxation-Based Coarsening for Multilevel Hypergraph Partitioning","abstract":" Multilevel partitioning methods that are inspired by principles of\nmultiscaling are the most powerful practical hypergraph partitioning solvers.\nHypergraph partitioning has many applications in disciplines ranging from\nscientific computing to data science. In this paper we introduce the concept of\nalgebraic distance on hypergraphs and demonstrate its use as an algorithmic\ncomponent in the coarsening stage of multilevel hypergraph partitioning\nsolvers. The algebraic distance is a vertex distance measure that extends\nhyperedge weights for capturing the local connectivity of vertices which is\ncritical for hypergraph coarsening schemes. The practical effectiveness of the\nproposed measure and corresponding coarsening scheme is demonstrated through\nextensive computational experiments on a diverse set of problems. Finally, we\npropose a benchmark of hypergraph partitioning problems to compare the quality\nof other solvers.\n","authors":"Ruslan Shaydulin|Jie Chen|Ilya Safro","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.06552v4","link_pdf":"http://arxiv.org/pdf/1710.06552v4","link_doi":"http://dx.doi.org/10.1137/17M1152735","comment":"","journal_ref":"","doi":"10.1137/17M1152735","primary_category":"cs.DM","categories":"cs.DM"} {"id":"1710.06590v1","submitted":"2017-10-18 06:14:53","updated":"2017-10-18 06:14:53","title":"MEDOC: a Python wrapper to load MEDLINE into a local MySQL database","abstract":" Since the MEDLINE database was released, the number of documents indexed by\nthis entity has risen every year. Several tools have been developed by the\nNational Institutes of Health (NIH) to query this corpus of scientific\npublications. However, in terms of advances in big data, text-mining and data\nscience, an option to build a local relational database containing all metadata\navailable on MEDLINE would be truly useful to optimally exploit these\nresources. MEDOC (MEdline DOwnloading Contrivance) is a Python program designed\nto download data on an FTP and to load all extracted information into a local\nMySQL database. It took MEDOC 4 days and 17 hours to load the 26 million\ndocuments available on this server onto a standard computer. This indexed\nrelational database allows the user to build complex and rapid queries. All\nfields can thus be searched for desired information, a task that is difficult\nto accomplish through the PubMed graphical interface. MEDOC is free and\npublicly available at https://github.com/MrMimic/MEDOC.\n","authors":"Emeric Dynomant|Mathilde Gorieu|Helene Perrin|Marion Denorme|Fabien Pichon|Arnaud Desfeux","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.06590v1","link_pdf":"http://arxiv.org/pdf/1710.06590v1","link_doi":"","comment":"4 pages, 1 figure","journal_ref":"","doi":"","primary_category":"cs.DL","categories":"cs.DL|cs.DB"} {"id":"1710.06811v1","submitted":"2017-10-18 16:11:43","updated":"2017-10-18 16:11:43","title":"Visual Progression Analysis of Student Records Data","abstract":" University curriculum, both on a campus level and on a per-major level, are\naffected in a complex way by many decisions of many administrators and faculty\nover time. As universities across the United States share an urgency to\nsignificantly improve student success and success retention, there is a\npressing need to better understand how the student population is progressing\nthrough the curriculum, and how to provide better supporting infrastructure and\nrefine the curriculum for the purpose of improving student outcomes. This work\nhas developed a visual knowledge discovery system called eCamp that pulls\ntogether a variety of populationscale data products, including student grades,\nmajor descriptions, and graduation records. These datasets were previously\ndisconnected and only available to and maintained by independent campus\noffices. The framework models and analyzes the multi-level relationships hidden\nwithin these data products, and visualizes the student flow patterns through\nindividual majors as well as through a hierarchy of majors. These results\nsupport analytical tasks involving student outcomes, student retention, and\ncurriculum design. It is shown how eCamp has revealed student progression\ninformation that was previously unavailable.\n","authors":"Mohammad Raji|John Duggan|Blaise DeCotes|Jian Huang|Bradley Vander Zanden","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.06811v1","link_pdf":"http://arxiv.org/pdf/1710.06811v1","link_doi":"","comment":"8 pages, 7 figures, Published in Visualization in Data Science (VDS\n 2017)","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1710.06839v1","submitted":"2017-10-18 17:44:17","updated":"2017-10-18 17:44:17","title":"Driving with Data: Modeling and Forecasting Vehicle Fleet Maintenance in\n Detroit","abstract":" The City of Detroit maintains an active fleet of over 2500 vehicles, spending\nan annual average of over \\$5 million on new vehicle purchases and over \\$7.7\nmillion on maintaining this fleet. Understanding the existence of patterns and\ntrends in this data could be useful to a variety of stakeholders, particularly\nas Detroit emerges from Chapter 9 bankruptcy, but the patterns in such data are\noften complex and multivariate and the city lacks dedicated resources for\ndetailed analysis of this data. This work, a data collaboration between the\nMichigan Data Science Team (http://midas.umich.edu/mdst) and the City of\nDetroit's Operations and Infrastructure Group, seeks to address this unmet need\nby analyzing data from the City of Detroit's entire vehicle fleet from\n2010-2017. We utilize tensor decomposition techniques to discover and visualize\nunique temporal patterns in vehicle maintenance; apply differential sequence\nmining to demonstrate the existence of common and statistically unique\nmaintenance sequences by vehicle make and model; and, after showing these\ntime-dependencies in the dataset, demonstrate an application of a predictive\nLong Short Term Memory (LSTM) neural network model to predict maintenance\nsequences. Our analysis shows both the complexities of municipal vehicle fleet\ndata and useful techniques for mining and modeling such data.\n","authors":"Josh Gardner|Danai Koutra|Jawad Mroueh|Victor Pang|Arya Farahi|Sam Krassenstein|Jared Webb","affiliations":"University of Michigan|University of Michigan|University of Michigan|University of Michigan|University of Michigan|City of Detroit|City of Detroit","link_abstract":"http://arxiv.org/abs/1710.06839v1","link_pdf":"http://arxiv.org/pdf/1710.06839v1","link_doi":"","comment":"Presented at the Data For Good Exchange 2017","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1710.06871v1","submitted":"2017-10-18 18:01:55","updated":"2017-10-18 18:01:55","title":"Promoting Saving for College Through Data Science","abstract":" The cost of attending college has been steadily rising and in 10 years is\nestimated to reach $140,000 for a 4-year public university. Recent surveys\nestimate just over half of US families are saving for college. State-operated\n529 college savings plans are an effective way for families to plan and save\nfor future college costs, but only 3% of families currently use them. The\nOffice of the Illinois State Treasurer (Treasurer) administers two 529 plans to\nhelp its residents save for college. In order to increase the number of\nfamilies saving for college, the Treasurer and Civis Analytics used data\nscience techniques to identify the people most likely to sign up for a college\nsavings plan. In this paper, we will discuss the use of person matching to join\naccountholder data from the Treasurer to the Civis National File, as well as\nthe use of lookalike modeling to identify new potential signups. In order to\navoid reinforcing existing demographic imbalances in who saves for college, the\nlookalike models used were ensured to be racially and economically balanced. We\nwill also discuss how these new signup targets were then individually served\ndigital ads to encourage opening college savings accounts.\n","authors":"Fernando Diaz|Natnaell Mammo","affiliations":"Office of the Illinois State Treasurer|Civis Analytics Washington","link_abstract":"http://arxiv.org/abs/1710.06871v1","link_pdf":"http://arxiv.org/pdf/1710.06871v1","link_doi":"","comment":"Presented at the Data For Good Exchange 2017","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1710.06881v1","submitted":"2017-10-18 18:08:13","updated":"2017-10-18 18:08:13","title":"Children and the Data Cycle: Rights and Ethics in a Big Data World","abstract":" In an era of increasing dependence on data science and big data, the voices\nof one set of major stakeholders - the world's children and those who advocate\non their behalf - have been largely absent. A recent paper estimates one in\nthree global internet users is a child, yet there has been little rigorous\ndebate or understanding of how to adapt traditional, offline ethical standards\nfor research, involving data collection from children, to a big data, online\nenvironment (Livingstone et al., 2015). This paper argues that due to the\npotential for severe, long-lasting and differential impacts on children, child\nrights need to be firmly integrated onto the agendas of global debates about\nethics and data science. The authors outline their rationale for a greater\nfocus on child rights and ethics in data science and suggest steps to move\nforward, focussing on the various actors within the data chain including data\ngenerators, collectors, analysts and end users. It concludes by calling for a\nmuch stronger appreciation of the links between child rights, ethics and data\nscience disciplines and for enhanced discourse between stakeholders in the data\nchain and those responsible for upholding the rights of children globally.\n","authors":"Gabrielle Berman|Kerry Albright","affiliations":"UNICEF Office of Research|UNICEF Office of Research","link_abstract":"http://arxiv.org/abs/1710.06881v1","link_pdf":"http://arxiv.org/pdf/1710.06881v1","link_doi":"","comment":"Presented at the Data For Good Exchange 2017","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1710.06882v1","submitted":"2017-10-18 18:10:35","updated":"2017-10-18 18:10:35","title":"Mapping for accessibility: A case study of ethics in data science for\n social good","abstract":" Ethics in the emerging world of data science are often discussed through\ncautionary tales about the dire consequences of missteps taken by high profile\ncompanies or organizations. We take a different approach by foregrounding the\nways that ethics are implicated in the day-to-day work of data science,\nfocusing on instances in which data scientists recognize, grapple with, and\nconscientiously respond to ethical challenges. This paper presents a case study\nof ethical dilemmas that arose in a \"data science for social good\" (DSSG)\nproject focused on improving navigation for people with limited mobility. We\ndescribe how this particular DSSG team responded to those dilemmas, and how\nthose responses gave rise to still more dilemmas. While the details of the case\ndiscussed here are unique, the ethical dilemmas they illuminate can commonly be\nfound across many DSSG projects. These include: the risk of exacerbating\ndisparities; the thorniness of algorithmic accountability; the evolving\nopportunities for mischief presented by new technologies; the subjective and\nvalue- laden interpretations at the heart of any data-intensive project; the\npotential for data to amplify or mute particular voices; the possibility of\nprivacy violations; and the folly of technological solutionism. Based on our\ntracing of the team's responses to these dilemmas, we distill lessons for an\nethical data science practice that can be more generally applied across DSSG\nprojects. Specifically, this case experience highlights the importance of: 1)\nSetting the scene early on for ethical thinking 2) Recognizing ethical\ndecision-making as an emergent phenomenon intertwined with the quotidian work\nof data science for social good 3) Approaching ethical thinking as a thoughtful\nand intentional balancing of priorities rather than a binary differentiation\nbetween right and wrong.\n","authors":"Anissa Tanweer|Nicholas Bolten|Margaret Drouhard|Jess Hamilton|Anat Caspi|Brittany Fiore-Gartland|Kaicheng Tan","affiliations":"University of Washington Seattle|University of Washington Seattle|University of Washington Seattle|University of Washington Seattle|University of Washington Seattle|University of Washington Seattle|University of Washington Seattle","link_abstract":"http://arxiv.org/abs/1710.06882v1","link_pdf":"http://arxiv.org/pdf/1710.06882v1","link_doi":"","comment":"Presented at the Data For Good Exchange 2017","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1710.06905v1","submitted":"2017-10-18 19:31:43","updated":"2017-10-18 19:31:43","title":"Predictors of Re-admission for Homeless Families in New York City: The\n Case of the Win Shelter Network","abstract":" New York City faces the challenge of an ever-increasing homeless population\nwith almost 60,000 people currently living in city shelters. In 2015,\napproximately 25% of families stayed longer than 9 months in a shelter, and 17%\nof families with children that exited a homeless shelter returned to the\nshelter system within 30 days of leaving. This suggests that \"long-term\"\nshelter residents and those that re-enter shelters contribute significantly to\nthe rise of the homeless population living in city shelters and indicate\nsystemic challenges to finding adequate permanent housing. Women in Need (Win)\nis a non-profit agency that provides shelter to almost 10,000 homeless women\nand children (10% of all homeless families of NYC), and is the largest homeless\nshelter provider in the City. This paper focuses on our preliminary work with\nWin to understand the factors that affect the rate of readmission of homeless\nfamilies at Win shelters, and to predict the likelihood of re-entry into the\nshelter system on exit. These insights will enable improved service delivery\nand operational efficiencies at these shelters. This paper describes our recent\nefforts to integrate Win datasets with city records to create a unified,\ncomprehensive database of the homeless population being served by Win shelters.\nA preliminary classification model is developed to predict the odds of\nreadmission and length of shelter stay based on the demographic and\nsocioeconomic characteristics of the homeless population served by Win. This\nwork is intended to form the basis for establishing a network of \"smart\nshelters\" through the use of data science and data technologies.\n","authors":"Constantine Kontokosta|Boyeong Hong|Awais Malik|Ira M. Bellach|Xueqi Huang|Kristi Korsberg|Dara Perl|Avikal Somvanshi","affiliations":"New York University|New York University|New York University|Women in Need NYC|New York University|New York University|New York University|New York University","link_abstract":"http://arxiv.org/abs/1710.06905v1","link_pdf":"http://arxiv.org/pdf/1710.06905v1","link_doi":"","comment":"Presented at the Data For Good Exchange 2017","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1710.07029v1","submitted":"2017-10-19 07:56:56","updated":"2017-10-19 07:56:56","title":"Visual Analysis of Spatio-Temporal Event Predictions: Investigating the\n Spread Dynamics of Invasive Species","abstract":" Invasive species are a major cause of ecological damage and commercial\nlosses. A current problem spreading in North America and Europe is the vinegar\nfly Drosophila suzukii. Unlike other Drosophila, it infests non-rotting and\nhealthy fruits and is therefore of concern to fruit growers, such as vintners.\nConsequently, large amounts of data about infestations have been collected in\nrecent years. However, there is a lack of interactive methods to investigate\nthis data. We employ ensemble-based classification to predict areas susceptible\nto infestation by D. suzukii and bring them into a spatio-temporal context\nusing maps and glyph-based visualizations. Following the information-seeking\nmantra, we provide a visual analysis system Drosophigator for spatio-temporal\nevent prediction, enabling the investigation of the spread dynamics of invasive\nspecies. We demonstrate the usefulness of this approach in two use cases.\n","authors":"Daniel Seebacher|Johannes Häußler|Michael Hundt|Manuel Stein|Hannes Müller|Ulrich Engelke|Daniel Keim","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.07029v1","link_pdf":"http://arxiv.org/pdf/1710.07029v1","link_doi":"","comment":"","journal_ref":"Symposium on Visualization in Data Science (VDS) at IEEE VIS 2017","doi":"","primary_category":"cs.HC","categories":"cs.HC"} {"id":"1710.08728v1","submitted":"2017-10-24 12:23:20","updated":"2017-10-24 12:23:20","title":"Greater data science at baccalaureate institutions","abstract":" Donoho's JCGS (in press) paper is a spirited call to action for\nstatisticians, who he points out are losing ground in the field of data science\nby refusing to accept that data science is its own domain. (Or, at least, a\ndomain that is becoming distinctly defined.) He calls on writings by John\nTukey, Bill Cleveland, and Leo Breiman, among others, to remind us that\nstatisticians have been dealing with data science for years, and encourages\nacceptance of the direction of the field while also ensuring that statistics is\ntightly integrated.\n As faculty at baccalaureate institutions (where the growth of undergraduate\nstatistics programs has been dramatic), we are keen to ensure statistics has a\nplace in data science and data science education. In his paper, Donoho is\nprimarily focused on graduate education. At our undergraduate institutions, we\nare considering many of the same questions.\n","authors":"Amelia McNamara|Nicholas J. Horton|Benjamin S. Baumer","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.08728v1","link_pdf":"http://arxiv.org/pdf/1710.08728v1","link_doi":"http://dx.doi.org/10.1080/10618600.2017.1386568","comment":"in press response to Donoho paper in Journal of Computational\n Graphics and Statistics","journal_ref":"","doi":"10.1080/10618600.2017.1386568","primary_category":"stat.OT","categories":"stat.OT|stat.ML"} {"id":"1710.08880v1","submitted":"2017-10-24 16:38:18","updated":"2017-10-24 16:38:18","title":"Wildbook: Crowdsourcing, computer vision, and data science for\n conservation","abstract":" Photographs, taken by field scientists, tourists, automated cameras, and\nincidental photographers, are the most abundant source of data on wildlife\ntoday. Wildbook is an autonomous computational system that starts from massive\ncollections of images and, by detecting various species of animals and\nidentifying individuals, combined with sophisticated data management, turns\nthem into high resolution information database, enabling scientific inquiry,\nconservation, and citizen science.\n We have built Wildbooks for whales (flukebook.org), sharks (whaleshark.org),\ntwo species of zebras (Grevy's and plains), and several others. In January\n2016, Wildbook enabled the first ever full species (the endangered Grevy's\nzebra) census using photographs taken by ordinary citizens in Kenya. The\nresulting numbers are now the official species census used by IUCN Red List:\nhttp://www.iucnredlist.org/details/7950/0. In 2016, Wildbook partnered up with\nWWF to build Wildbook for Sea Turtles, Internet of Turtles (IoT), as well as\nsystems for seals and lynx. Most recently, we have demonstrated that we can now\nuse publicly available social media images to count and track wild animals.\n In this paper we present and discuss both the impact and challenges that the\nuse of crowdsourced images can have on wildlife conservation.\n","authors":"Tanya Y. Berger-Wolf|Daniel I. Rubenstein|Charles V. Stewart|Jason A. Holmberg|Jason Parham|Sreejith Menon|Jonathan Crall|Jon Van Oast|Emre Kiciman|Lucas Joppa","affiliations":"University of Illinois at Chicago|Princeton University|Rensselaer Polytechnic Inst.|WildMe.org|Rensselaer Polytechnic Inst.|Bloomberg LP|Rensselaer Polytechnic Inst.|WildMe.org|Microsoft Research|Microsoft Research","link_abstract":"http://arxiv.org/abs/1710.08880v1","link_pdf":"http://arxiv.org/pdf/1710.08880v1","link_doi":"","comment":"Presented at the Data For Good Exchange 2017","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1710.09549v3","submitted":"2017-10-26 05:36:35","updated":"2017-12-03 00:17:37","title":"Context-Aware Generative Adversarial Privacy","abstract":" Preserving the utility of published datasets while simultaneously providing\nprovable privacy guarantees is a well-known challenge. On the one hand,\ncontext-free privacy solutions, such as differential privacy, provide strong\nprivacy guarantees, but often lead to a significant reduction in utility. On\nthe other hand, context-aware privacy solutions, such as information theoretic\nprivacy, achieve an improved privacy-utility tradeoff, but assume that the data\nholder has access to dataset statistics. We circumvent these limitations by\nintroducing a novel context-aware privacy framework called generative\nadversarial privacy (GAP). GAP leverages recent advancements in generative\nadversarial networks (GANs) to allow the data holder to learn privatization\nschemes from the dataset itself. Under GAP, learning the privacy mechanism is\nformulated as a constrained minimax game between two players: a privatizer that\nsanitizes the dataset in a way that limits the risk of inference attacks on the\nindividuals' private variables, and an adversary that tries to infer the\nprivate variables from the sanitized dataset. To evaluate GAP's performance, we\ninvestigate two simple (yet canonical) statistical dataset models: (a) the\nbinary data model, and (b) the binary Gaussian mixture model. For both models,\nwe derive game-theoretically optimal minimax privacy mechanisms, and show that\nthe privacy mechanisms learned from data (in a generative adversarial fashion)\nmatch the theoretically optimal ones. This demonstrates that our framework can\nbe easily applied in practice, even in the absence of dataset statistics.\n","authors":"Chong Huang|Peter Kairouz|Xiao Chen|Lalitha Sankar|Ram Rajagopal","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.09549v3","link_pdf":"http://arxiv.org/pdf/1710.09549v3","link_doi":"http://dx.doi.org/10.3390/e19120656","comment":"Improved version of a paper accepted by Entropy Journal, Special\n Issue on Information Theory in Machine Learning and Data Science","journal_ref":"","doi":"10.3390/e19120656","primary_category":"cs.LG","categories":"cs.LG|cs.AI|cs.CR|cs.GT|cs.IT|math.IT"} {"id":"1710.09787v2","submitted":"2017-10-26 16:17:36","updated":"2017-11-18 21:07:43","title":"Optimal Shrinkage of Singular Values Under Random Data Contamination","abstract":" A low rank matrix X has been contaminated by uniformly distributed noise,\nmissing values, outliers and corrupt entries. Reconstruction of X from the\nsingular values and singular vectors of the contaminated matrix Y is a key\nproblem in machine learning, computer vision and data science. In this paper we\nshow that common contamination models (including arbitrary combinations of\nuniform noise,missing values, outliers and corrupt entries) can be described\nefficiently using a single framework. We develop an asymptotically optimal\nalgorithm that estimates X by manipulation of the singular values of Y , which\napplies to any of the contamination models considered. Finally, we find an\nexplicit signal-to-noise cutoff, below which estimation of X from the singular\nvalue decomposition of Y must fail, in a well-defined sense.\n","authors":"Danny Barash|Matan Gavish","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.09787v2","link_pdf":"http://arxiv.org/pdf/1710.09787v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.IT","categories":"cs.IT|cs.LG|math.IT|stat.ML"} {"id":"1710.10355v2","submitted":"2017-10-27 23:53:13","updated":"2018-02-23 20:00:07","title":"Convolutional Neural Networks Via Node-Varying Graph Filters","abstract":" Convolutional neural networks (CNNs) are being applied to an increasing\nnumber of problems and fields due to their superior performance in\nclassification and regression tasks. Since two of the key operations that CNNs\nimplement are convolution and pooling, this type of networks is implicitly\ndesigned to act on data described by regular structures such as images.\nMotivated by the recent interest in processing signals defined in irregular\ndomains, we advocate a CNN architecture that operates on signals supported on\ngraphs. The proposed design replaces the classical convolution not with a\nnode-invariant graph filter (GF), which is the natural generalization of\nconvolution to graph domains, but with a node-varying GF. This filter extracts\ndifferent local features without increasing the output dimension of each layer\nand, as a result, bypasses the need for a pooling stage while involving only\nlocal operations. A second contribution is to replace the node-varying GF with\na hybrid node-varying GF, which is a new type of GF introduced in this paper.\nWhile the alternative architecture can still be run locally without requiring a\npooling stage, the number of trainable parameters is smaller and can be\nrendered independent of the data dimension. Tests are run on a synthetic source\nlocalization problem and on the 20NEWS dataset.\n","authors":"Fernando Gama|Geert Leus|Antonio G. Marques|Alejandro Ribeiro","affiliations":"","link_abstract":"http://arxiv.org/abs/1710.10355v2","link_pdf":"http://arxiv.org/pdf/1710.10355v2","link_doi":"","comment":"Submitted to DSW 2018 (IEEE Data Science Workshop)","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.NE"} {"id":"1710.11039v1","submitted":"2017-10-30 16:13:57","updated":"2017-10-30 16:13:57","title":"Data Science: A Powerful Catalyst for Cross-Sector Collaborations to\n Transform the Future of Global Health - Developing a New Interactive\n Relational Mapping Tool","abstract":" The increasingly complex and rapidly changing global health and\nsocio-economic landscape requires fundamentally new ways of thinking, acting\nand collaborating to solve growing systems challenges. Cross-sectoral\ncollaborations between governments, businesses, international organizations,\nprivate investors, academia and non-profits are essential for lasting success\nin achieving the Sustainable Development Goals (SDGs), and securing a\nprosperous future for the health and wellbeing of all people. Our aim is to use\ndata science and innovative technologies to map diverse stakeholders and their\ninitiatives around SDGs and specific health targets - with particular focus on\nSDG 3 (Good Health & Well Being) and SDG 17 (Partnerships for the Goals) - to\naccelerate cross-sector collaborations. Initially, the mapping tool focuses on\nGeneva, Switzerland as the world center of global health diplomacy with over 80\nkey stakeholders and influencers present. As we develop the next level pilot,\nwe aim to build on users' interests, with a potential focus on non-communicable\ndiseases (NCDs) as one of the emerging and most pressing global health issues\nthat requires new collaborative approaches. Building on this pilot, we can\nlater expand beyond only SDG 3 to other SDGs.\n","authors":"Barbara Bulc|Cassie Landers|Katherine Driscoll","affiliations":"Global Development|Columbia University|Columbia University","link_abstract":"http://arxiv.org/abs/1710.11039v1","link_pdf":"http://arxiv.org/pdf/1710.11039v1","link_doi":"","comment":"Presented at the Data For Good Exchange 2017","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1711.00028v1","submitted":"2017-10-31 18:07:10","updated":"2017-10-31 18:07:10","title":"Hack Weeks as a model for Data Science Education and Collaboration","abstract":" Across almost all scientific disciplines, the instruments that record our\nexperimental data and the methods required for storage and data analysis are\nrapidly increasing in complexity. This gives rise to the need for scientific\ncommunities to adapt on shorter time scales than traditional university\ncurricula allow for, and therefore requires new modes of knowledge transfer.\nThe universal applicability of data science tools to a broad range of problems\nhas generated new opportunities to foster exchange of ideas and computational\nworkflows across disciplines. In recent years, hack weeks have emerged as an\neffective tool for fostering these exchanges by providing training in modern\ndata analysis workflows. While there are variations in hack week\nimplementation, all events consist of a common core of three components:\ntutorials in state-of-the-art methodology, peer-learning and project work in a\ncollaborative environment. In this paper, we present the concept of a hack week\nin the larger context of scientific meetings and point out similarities and\ndifferences to traditional conferences. We motivate the need for such an event\nand present in detail its strengths and challenges. We find that hack weeks are\nsuccessful at cultivating collaboration and the exchange of knowledge.\nParticipants self-report that these events help them both in their day-to-day\nresearch as well as their careers. Based on our results, we conclude that hack\nweeks present an effective, easy-to-implement, fairly low-cost tool to\npositively impact data analysis literacy in academic disciplines, foster\ncollaboration and cultivate best practices.\n","authors":"Daniela Huppenkothen|Anthony Arendt|David W. Hogg|Karthik Ram|Jake VanderPlas|Ariel Rokem","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.00028v1","link_pdf":"http://arxiv.org/pdf/1711.00028v1","link_doi":"http://dx.doi.org/10.1073/pnas.1717196115","comment":"15 pages, 2 figures, submitted to PNAS, all relevant code available\n at https://github.com/uwescience/HackWeek-Writeup","journal_ref":"","doi":"10.1073/pnas.1717196115","primary_category":"physics.ed-ph","categories":"physics.ed-ph|astro-ph.IM|cs.CY"} {"id":"1711.00487v1","submitted":"2017-11-01 18:03:07","updated":"2017-11-01 18:03:07","title":"Tensor Valued Common and Individual Feature Extraction:\n Multi-dimensional Perspective","abstract":" A novel method for common and individual feature analysis from exceedingly\nlarge-scale data is proposed, in order to ensure the tractability of both the\ncomputation and storage and thus mitigate the curse of dimensionality, a major\nbottleneck in modern data science. This is achieved by making use of the\ninherent redundancy in so-called multi-block data structures, which represent\nmultiple observations of the same phenomenon taken at different times, angles\nor recording conditions. Upon providing an intrinsic link between the\nproperties of the outer vector product and extracted features in tensor\ndecompositions (TDs), the proposed common and individual information extraction\nfrom multi-block data is performed through imposing physical meaning to\notherwise unconstrained factorisation approaches. This is shown to dramatically\nreduce the dimensionality of search spaces for subsequent classification\nprocedures and to yield greatly enhanced accuracy. Simulations on a multi-class\nclassification task of large-scale extraction of individual features from a\ncollection of partially related real-world images demonstrate the advantages of\nthe \"blessing of dimensionality\" associated with TDs.\n","authors":"Ilia Kisil|Giuseppe G. Calvi|Danilo P. Mandic","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.00487v1","link_pdf":"http://arxiv.org/pdf/1711.00487v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"eess.SP","categories":"eess.SP|stat.ML"} {"id":"1711.00817v3","submitted":"2017-11-02 17:00:05","updated":"2017-11-07 07:15:42","title":"Medoids in almost linear time via multi-armed bandits","abstract":" Computing the medoid of a large number of points in high-dimensional space is\nan increasingly common operation in many data science problems. We present an\nalgorithm Med-dit which uses O(n log n) distance evaluations to compute the\nmedoid with high probability. Med-dit is based on a connection with the\nmulti-armed bandit problem. We evaluate the performance of Med-dit empirically\non the Netflix-prize and the single-cell RNA-Seq datasets, containing hundreds\nof thousands of points living in tens of thousands of dimensions, and observe a\n5-10x improvement in performance over the current state of the art. Med-dit is\navailable at https://github.com/bagavi/Meddit\n","authors":"Vivek Bagaria|Govinda M. Kamath|Vasilis Ntranos|Martin J. Zhang|David Tse","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.00817v3","link_pdf":"http://arxiv.org/pdf/1711.00817v3","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.DS|cs.IT|cs.LG|math.IT"} {"id":"1711.01835v1","submitted":"2017-11-06 11:18:55","updated":"2017-11-06 11:18:55","title":"Asymptotics for high-dimensional covariance matrices and quadratic forms\n with applications to the trace functional and shrinkage","abstract":" We establish large sample approximations for an arbitray number of bilinear\nforms of the sample variance-covariance matrix of a high-dimensional vector\ntime series using $ \\ell_1$-bounded and small $\\ell_2$-bounded weighting\nvectors. Estimation of the asymptotic covariance structure is also discussed.\nThe results hold true without any constraint on the dimension, the number of\nforms and the sample size or their ratios. Concrete and potential applications\nare widespread and cover high-dimensional data science problems such as tests\nfor large numbers of covariances, sparse portfolio optimization and projections\nonto sparse principal components or more general spanning sets as frequently\nconsidered, e.g. in classification and dictionary learning. As two specific\napplications of our results, we study in greater detail the asymptotics of the\ntrace functional and shrinkage estimation of covariance matrices. In shrinkage\nestimation, it turns out that the asymptotics differs for weighting vectors\nbounded away from orthogonaliy and nearly orthogonal ones in the sense that\ntheir inner product converges to 0.\n","authors":"Ansgar Steland|Rainer von Sachs","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.01835v1","link_pdf":"http://arxiv.org/pdf/1711.01835v1","link_doi":"","comment":"42 pages","journal_ref":"","doi":"","primary_category":"math.PR","categories":"math.PR|math.ST|stat.TH|60F17, 62E20"} {"id":"1711.03091v4","submitted":"2017-11-08 18:50:49","updated":"2018-10-22 17:27:50","title":"Dispersion for Data-Driven Algorithm Design, Online Learning, and\n Private Optimization","abstract":" Data-driven algorithm design, that is, choosing the best algorithm for a\nspecific application, is a crucial problem in modern data science.\nPractitioners often optimize over a parameterized algorithm family, tuning\nparameters based on problems from their domain. These procedures have\nhistorically come with no guarantees, though a recent line of work studies\nalgorithm selection from a theoretical perspective. We advance the foundations\nof this field in several directions: we analyze online algorithm selection,\nwhere problems arrive one-by-one and the goal is to minimize regret, and\nprivate algorithm selection, where the goal is to find good parameters over a\nset of problems without revealing sensitive information contained therein. We\nstudy important algorithm families, including SDP-rounding schemes for problems\nformulated as integer quadratic programs, and greedy techniques for canonical\nsubset selection problems. In these cases, the algorithm's performance is a\nvolatile and piecewise Lipschitz function of its parameters, since tweaking the\nparameters can completely change the algorithm's behavior. We give a sufficient\nand general condition, dispersion, defining a family of piecewise Lipschitz\nfunctions that can be optimized online and privately, which includes the\nfunctions measuring the performance of the algorithms we study. Intuitively, a\nset of piecewise Lipschitz functions is dispersed if no small region contains\nmany of the functions' discontinuities. We present general techniques for\nonline and private optimization of the sum of dispersed piecewise Lipschitz\nfunctions. We improve over the best-known regret bounds for a variety of\nproblems, prove regret bounds for problems not previously studied, and give\nmatching lower bounds. We also give matching upper and lower bounds on the\nutility loss due to privacy. Moreover, we uncover dispersion in auction design\nand pricing problems.\n","authors":"Maria-Florina Balcan|Travis Dick|Ellen Vitercik","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.03091v4","link_pdf":"http://arxiv.org/pdf/1711.03091v4","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG"} {"id":"1711.03219v1","submitted":"2017-11-09 00:32:01","updated":"2017-11-09 00:32:01","title":"Denotational validation of higher-order Bayesian inference","abstract":" We present a modular semantic account of Bayesian inference algorithms for\nprobabilistic programming languages, as used in data science and machine\nlearning. Sophisticated inference algorithms are often explained in terms of\ncomposition of smaller parts. However, neither their theoretical justification\nnor their implementation reflects this modularity. We show how to conceptualise\nand analyse such inference algorithms as manipulating intermediate\nrepresentations of probabilistic programs using higher-order functions and\ninductive types, and their denotational semantics. Semantic accounts of\ncontinuous distributions use measurable spaces. However, our use of\nhigher-order functions presents a substantial technical difficulty: it is\nimpossible to define a measurable space structure over the collection of\nmeasurable functions between arbitrary measurable spaces that is compatible\nwith standard operations on those functions, such as function application. We\novercome this difficulty using quasi-Borel spaces, a recently proposed\nmathematical structure that supports both function spaces and continuous\ndistributions. We define a class of semantic structures for representing\nprobabilistic programs, and semantic validity criteria for transformations of\nthese representations in terms of distribution preservation. We develop a\ncollection of building blocks for composing representations. We use these\nbuilding blocks to validate common inference algorithms such as Sequential\nMonte Carlo and Markov Chain Monte Carlo. To emphasize the connection between\nthe semantic manipulation and its traditional measure theoretic origins, we use\nKock's synthetic measure theory. We demonstrate its usefulness by proving a\nquasi-Borel counterpart to the Metropolis-Hastings-Green theorem.\n","authors":"Adam Ścibior|Ohad Kammar|Matthijs Vákár|Sam Staton|Hongseok Yang|Yufei Cai|Klaus Ostermann|Sean K. Moss|Chris Heunen|Zoubin Ghahramani","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.03219v1","link_pdf":"http://arxiv.org/pdf/1711.03219v1","link_doi":"http://dx.doi.org/10.1145/3158148","comment":"","journal_ref":"Proc. ACM Program. Lang. 2, POPL, Article 60 (January 2018)","doi":"10.1145/3158148","primary_category":"cs.PL","categories":"cs.PL"} {"id":"1711.04126v4","submitted":"2017-11-11 12:32:01","updated":"2018-05-22 02:41:02","title":"Adversarial Training for Disease Prediction from Electronic Health\n Records with Missing Data","abstract":" Electronic health records (EHRs) have contributed to the computerization of\npatient records and can thus be used not only for efficient and systematic\nmedical services, but also for research on biomedical data science. However,\nthere are many missing values in EHRs when provided in matrix form, which is an\nimportant issue in many biomedical EHR applications. In this paper, we propose\na two-stage framework that includes missing data imputation and disease\nprediction to address the missing data problem in EHRs. We compared the disease\nprediction performance of generative adversarial networks (GANs) and\nconventional learning algorithms in combination with missing data prediction\nmethods. As a result, we obtained a level of accuracy of 0.9777, sensitivity of\n0.9521, specificity of 0.9925, area under the receiver operating characteristic\ncurve (AUC-ROC) of 0.9889, and F-score of 0.9688 with a stacked autoencoder as\nthe missing data prediction method and an auxiliary classifier GAN (AC-GAN) as\nthe disease prediction method. The comparison results show that a combination\nof a stacked autoencoder and an AC-GAN significantly outperforms other existing\napproaches. Our results suggest that the proposed framework is more robust for\ndisease prediction from EHRs with missing data.\n","authors":"Uiwon Hwang|Sungwoon Choi|Han-Byoel Lee|Sungroh Yoon","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.04126v4","link_pdf":"http://arxiv.org/pdf/1711.04126v4","link_doi":"","comment":"10 pages, 4 figures","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1711.04495v1","submitted":"2017-11-13 10:02:42","updated":"2017-11-13 10:02:42","title":"Geo-spatial Monitoring Of Infectious Diseases By Unmanned Aerial\n Vehicles","abstract":" Recent development in unmanned UAV technology paved the way for numerous\napplications in diverse cross discipline fields. One of the main feature of UAV\ns are their portability in terms of size that allows them to navigate through\nfairly hostile environments and collect data . This data collection leads to\nthe interpretation of the behavior and predictability according to the analysis\nas presented by data science. The application of UAV to monitor the population\nand climate geography is well documented. But the usage of UAV to study the\ngerms in the atmosphere is not well documented or absent. As air remains one of\nmain medium of transmission of germs so there must be some kind of signature\nspecific for a particular kind of germ. Using this as cue in this present\ncommunication a hypothetical model to study the spread of disease is presented.\nThis model can help the epidemiologists to understand the mechanism of\nmicrobial traffic like for example flu getting transferred within the same\nspecies or cross species ,spatial diffusion like for example human traveling\npattern and newly recognized diseases for example various type of flu and\nvector borne diseases like malaria , dengue etc. This model also covers some\nrelevant scenarios like global climate change, political ecologic emergences of\naerial transmitted diseases.\n","authors":"Chiranjib Patra","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.04495v1","link_pdf":"http://arxiv.org/pdf/1711.04495v1","link_doi":"","comment":"This paper was presented at GeoMundus 2017\n (http://www.geomundus.org/2017/) and was one of the winners of Travel Grant\n for the presentation of Abstract at the Institute for GeoInformatics ,\n Munster, Germany","journal_ref":"","doi":"","primary_category":"q-bio.PE","categories":"q-bio.PE"} {"id":"1711.04712v1","submitted":"2017-11-13 17:22:00","updated":"2017-11-13 17:22:00","title":"Randomized Near Neighbor Graphs, Giant Components, and Applications in\n Data Science","abstract":" If we pick $n$ random points uniformly in $[0,1]^d$ and connect each point to\nits $k-$nearest neighbors, then it is well known that there exists a giant\nconnected component with high probability. We prove that in $[0,1]^d$ it\nsuffices to connect every point to $ c_{d,1} \\log{\\log{n}}$ points chosen\nrandomly among its $ c_{d,2} \\log{n}-$nearest neighbors to ensure a giant\ncomponent of size $n - o(n)$ with high probability. This construction yields a\nmuch sparser random graph with $\\sim n \\log\\log{n}$ instead of $\\sim n \\log{n}$\nedges that has comparable connectivity properties. This result has nontrivial\nimplications for problems in data science where an affinity matrix is\nconstructed: instead of picking the $k-$nearest neighbors, one can often pick\n$k' \\ll k$ random points out of the $k-$nearest neighbors without sacrificing\nefficiency. This can massively simplify and accelerate computation, we\nillustrate this with several numerical examples.\n","authors":"George C. Linderman|Gal Mishne|Yuval Kluger|Stefan Steinerberger","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.04712v1","link_pdf":"http://arxiv.org/pdf/1711.04712v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.CO","categories":"math.CO|cs.DM|cs.DS|math.PR|stat.ML"} {"id":"1711.05123v5","submitted":"2017-11-14 14:46:46","updated":"2020-02-28 17:30:31","title":"Preconditioned proximal point methods and notions of partial\n subregularity","abstract":" Based on the needs of convergence proofs of preconditioned proximal point\nmethods, we introduce notions of partial strong submonotonicity and partial\n(metric) subregularity of set-valued maps. We study relationships between these\ntwo concepts, neither of which is generally weaker or stronger than the other\none. For our algorithmic purposes, the novel submonotonicity turns out to be\neasier to employ than more conventional error bounds obtained from\nsubregularity. Using strong submonotonicity, we demonstrate the linear\nconvergence of the Primal-Dual Proximal splitting method to some strictly\ncomplementary solutions of example problems from image processing and data\nscience. This is without the conventional assumption that all the objective\nfunctions of the involved saddle point problem are strongly convex.\n","authors":"Tuomo Valkonen","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.05123v5","link_pdf":"http://arxiv.org/pdf/1711.05123v5","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.OC","categories":"math.OC"} {"id":"1711.05887v1","submitted":"2017-11-16 01:58:20","updated":"2017-11-16 01:58:20","title":"On Analyzing Job Hop Behavior and Talent Flow Networks","abstract":" Analyzing job hopping behavior is important for the understanding of job\npreference and career progression of working individuals. When analyzed at the\nworkforce population level, job hop analysis helps to gain insights of talent\nflow and organization competition. Traditionally, surveys are conducted on job\nseekers and employers to study job behavior. While surveys are good at getting\ndirect user input to specially designed questions, they are often not scalable\nand timely enough to cope with fast-changing job landscape. In this paper, we\npresent a data science approach to analyze job hops performed by about 490,000\nworking professionals located in a city using their publicly shared profiles.\nWe develop several metrics to measure how much work experience is needed to\ntake up a job and how recent/established the job is, and then examine how these\nmetrics correlate with the propensity of hopping. We also study how job hop\nbehavior is related to job promotion/demotion. Finally, we perform network\nanalyses at the job and organization levels in order to derive insights on\ntalent flow as well as job and organizational competitiveness.\n","authors":"Richard J. Oentaryo|Xavier Jayaraj Siddarth Ashok|Ee-Peng Lim|Philips Kokoh Prasetyo","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.05887v1","link_pdf":"http://arxiv.org/pdf/1711.05887v1","link_doi":"http://dx.doi.org/10.1109/ICDMW.2017.172","comment":"","journal_ref":"ICDM Data Science for Human Capital Management 2017","doi":"10.1109/ICDMW.2017.172","primary_category":"cs.SI","categories":"cs.SI|stat.AP"} {"id":"1711.06538v1","submitted":"2017-11-17 13:58:44","updated":"2017-11-17 13:58:44","title":"Discovery of Complex Anomalous Patterns of Sexual Violence in El\n Salvador","abstract":" When sexual violence is a product of organized crime or social imaginary, the\nlinks between sexual violence episodes can be understood as a latent structure.\nWith this assumption in place, we can use data science to uncover complex\npatterns. In this paper we focus on the use of data mining techniques to unveil\ncomplex anomalous spatiotemporal patterns of sexual violence. We illustrate\ntheir use by analyzing all reported rapes in El Salvador over a period of nine\nyears. Through our analysis, we are able to provide evidence of phenomena that,\nto the best of our knowledge, have not been previously reported in literature.\nWe devote special attention to a pattern we discover in the East, where\nunderage victims report their boyfriends as perpetrators at anomalously high\nrates. Finally, we explain how such analyzes could be conducted in real-time,\nenabling early detection of emerging patterns to allow law enforcement agencies\nand policy makers to react accordingly.\n","authors":"Maria De-Arteaga|Artur Dubrawski","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.06538v1","link_pdf":"http://arxiv.org/pdf/1711.06538v1","link_doi":"http://dx.doi.org/10.5281/zenodo.571551","comment":"Conference paper at Data for Policy 2016 - Frontiers of Data Science\n for Government: Ideas, Practices and Projections (Data for Policy)","journal_ref":"","doi":"10.5281/zenodo.571551","primary_category":"stat.AP","categories":"stat.AP"} {"id":"1711.07580v1","submitted":"2017-11-20 23:42:34","updated":"2017-11-20 23:42:34","title":"Science Driven Innovations Powering Mobile Product: Cloud AI vs. Device\n AI Solutions on Smart Device","abstract":" Recent years have witnessed the increasing popularity of mobile devices (such\nas iphone) due to the convenience that it brings to human lives. On one hand,\nrich user profiling and behavior data (including per-app level, app-interaction\nlevel and system-interaction level) from heterogeneous information sources make\nit possible to provide much better services (such as recommendation,\nadvertisement targeting) to customers, which further drives revenue from\nunderstanding users' behaviors and improving user' engagement. In order to\ndelight the customers, intelligent personal assistants (such as Amazon Alexa,\nGoogle Home and Google Now) are highly desirable to provide real-time audio,\nvideo and image recognition, natural language understanding, comfortable user\ninteraction interface, satisfactory recommendation and effective advertisement\ntargeting.\n This paper presents the research efforts we have conducted on mobile devices\nwhich aim to provide much smarter and more convenient services by leveraging\nstatistics and big data science, machine learning and deep learning, user\nmodeling and marketing techniques to bring in significant user growth and user\nengagement and satisfactions (and happiness) on mobile devices. The developed\nnew features are built at either cloud side or device side, harmonically\nworking together to enhance the current service with the purpose of increasing\nusers' happiness. We illustrate how we design these new features from system\nand algorithm perspective using different case studies, through which one can\neasily understand how science driven innovations help to provide much better\nservice in technology and bring more revenue liftup in business. In the\nmeantime, these research efforts have clear scientific contributions and\npublished in top venues, which are playing more and more important roles for\nmobile AI products.\n","authors":"Deguang Kong","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.07580v1","link_pdf":"http://arxiv.org/pdf/1711.07580v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1711.08324v1","submitted":"2017-11-22 15:14:10","updated":"2017-11-22 15:14:10","title":"Evaluate the Malignancy of Pulmonary Nodules Using the 3D Deep Leaky\n Noisy-or Network","abstract":" Automatic diagnosing lung cancer from Computed Tomography (CT) scans involves\ntwo steps: detect all suspicious lesions (pulmonary nodules) and evaluate the\nwhole-lung/pulmonary malignancy. Currently, there are many studies about the\nfirst step, but few about the second step. Since the existence of nodule does\nnot definitely indicate cancer, and the morphology of nodule has a complicated\nrelationship with cancer, the diagnosis of lung cancer demands careful\ninvestigations on every suspicious nodule and integration of information of all\nnodules. We propose a 3D deep neural network to solve this problem. The model\nconsists of two modules. The first one is a 3D region proposal network for\nnodule detection, which outputs all suspicious nodules for a subject. The\nsecond one selects the top five nodules based on the detection confidence,\nevaluates their cancer probabilities and combines them with a leaky noisy-or\ngate to obtain the probability of lung cancer for the subject. The two modules\nshare the same backbone network, a modified U-net. The over-fitting caused by\nthe shortage of training data is alleviated by training the two modules\nalternately. The proposed model won the first place in the Data Science Bowl\n2017 competition. The code has been made publicly available.\n","authors":"Fangzhou Liao|Ming Liang|Zhe Li|Xiaolin Hu|Sen Song","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.08324v1","link_pdf":"http://arxiv.org/pdf/1711.08324v1","link_doi":"http://dx.doi.org/10.1109/TNNLS.2019.2892409","comment":"12 pages, 9 figures","journal_ref":"","doi":"10.1109/TNNLS.2019.2892409","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1711.09279v1","submitted":"2017-11-25 20:11:41","updated":"2017-11-25 20:11:41","title":"A Big Data Analysis Framework Using Apache Spark and Deep Learning","abstract":" With the spreading prevalence of Big Data, many advances have recently been\nmade in this field. Frameworks such as Apache Hadoop and Apache Spark have\ngained a lot of traction over the past decades and have become massively\npopular, especially in industries. It is becoming increasingly evident that\neffective big data analysis is key to solving artificial intelligence problems.\nThus, a multi-algorithm library was implemented in the Spark framework, called\nMLlib. While this library supports multiple machine learning algorithms, there\nis still scope to use the Spark setup efficiently for highly time-intensive and\ncomputationally expensive procedures like deep learning. In this paper, we\npropose a novel framework that combines the distributive computational\nabilities of Apache Spark and the advanced machine learning architecture of a\ndeep multi-layer perceptron (MLP), using the popular concept of Cascade\nLearning. We conduct empirical analysis of our framework on two real world\ndatasets. The results are encouraging and corroborate our proposed framework,\nin turn proving that it is an improvement over traditional big data analysis\nmethods that use either Spark or Deep learning as individual elements.\n","authors":"Anand Gupta|Hardeo Thakur|Ritvik Shrivastava|Pulkit Kumar|Sreyashi Nag","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.09279v1","link_pdf":"http://arxiv.org/pdf/1711.09279v1","link_doi":"","comment":"To be published in IEEE ICDM 2017 (International Conference on Data\n Mining) Workshop on Data Science and Big Data Analytics (DSBDA)","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB|cs.LG|stat.ML"} {"id":"1711.09726v3","submitted":"2017-11-27 14:56:38","updated":"2018-01-31 13:59:45","title":"Exploiting the potential of unlabeled endoscopic video data with\n self-supervised learning","abstract":" Surgical data science is a new research field that aims to observe all\naspects of the patient treatment process in order to provide the right\nassistance at the right time. Due to the breakthrough successes of deep\nlearning-based solutions for automatic image annotation, the availability of\nreference annotations for algorithm training is becoming a major bottleneck in\nthe field. The purpose of this paper was to investigate the concept of\nself-supervised learning to address this issue.\n Our approach is guided by the hypothesis that unlabeled video data can be\nused to learn a representation of the target domain that boosts the performance\nof state-of-the-art machine learning algorithms when used for pre-training.\nCore of the method is an auxiliary task based on raw endoscopic video data of\nthe target domain that is used to initialize the convolutional neural network\n(CNN) for the target task. In this paper, we propose the re-colorization of\nmedical images with a generative adversarial network (GAN)-based architecture\nas auxiliary task. A variant of the method involves a second pre-training step\nbased on labeled data for the target task from a related domain. We validate\nboth variants using medical instrument segmentation as target task.\n The proposed approach can be used to radically reduce the manual annotation\neffort involved in training CNNs. Compared to the baseline approach of\ngenerating annotated data from scratch, our method decreases exploratively the\nnumber of labeled images by up to 75% without sacrificing performance. Our\nmethod also outperforms alternative methods for CNN pre-training, such as\npre-training on publicly available non-medical or medical data using the target\ntask (in this instance: segmentation).\n As it makes efficient use of available (non-)public and (un-)labeled data,\nthe approach has the potential to become a valuable tool for CNN\n(pre-)training.\n","authors":"Tobias Ross|David Zimmerer|Anant Vemuri|Fabian Isensee|Manuel Wiesenfarth|Sebastian Bodenstedt|Fabian Both|Philip Kessler|Martin Wagner|Beat Müller|Hannes Kenngott|Stefanie Speidel|Annette Kopp-Schneider|Klaus Maier-Hein|Lena Maier-Hein","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.09726v3","link_pdf":"http://arxiv.org/pdf/1711.09726v3","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1711.10292v1","submitted":"2017-11-28 13:54:56","updated":"2017-11-28 13:54:56","title":"Providing theoretical learning guarantees to Deep Learning Networks","abstract":" Deep Learning (DL) is one of the most common subjects when Machine Learning\nand Data Science approaches are considered. There are clearly two movements\nrelated to DL: the first aggregates researchers in quest to outperform other\nalgorithms from literature, trying to win contests by considering often small\ndecreases in the empirical risk; and the second investigates overfitting\nevidences, questioning the learning capabilities of DL classifiers. Motivated\nby such opposed points of view, this paper employs the Statistical Learning\nTheory (SLT) to study the convergence of Deep Neural Networks, with particular\ninterest in Convolutional Neural Networks. In order to draw theoretical\nconclusions, we propose an approach to estimate the Shattering coefficient of\nthose classification algorithms, providing a lower bound for the complexity of\ntheir space of admissible functions, a.k.a. algorithm bias. Based on such\nestimator, we generalize the complexity of network biases, and, next, we study\nAlexNet and VGG16 architectures in the point of view of their Shattering\ncoefficients, and number of training examples required to provide theoretical\nlearning guarantees. From our theoretical formulation, we show the conditions\nwhich Deep Neural Networks learn as well as point out another issue: DL\nbenchmarks may be strictly driven by empirical risks, disregarding the\ncomplexity of algorithms biases.\n","authors":"Rodrigo Fernandes de Mello|Martha Dais Ferreira|Moacir Antonelli Ponti","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.10292v1","link_pdf":"http://arxiv.org/pdf/1711.10292v1","link_doi":"","comment":"Submitted to JMLR","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG"} {"id":"1711.10558v1","submitted":"2017-11-28 21:58:52","updated":"2017-11-28 21:58:52","title":"Intent-Aware Contextual Recommendation System","abstract":" Recommender systems take inputs from user history, use an internal ranking\nalgorithm to generate results and possibly optimize this ranking based on\nfeedback. However, often the recommender system is unaware of the actual intent\nof the user and simply provides recommendations dynamically without properly\nunderstanding the thought process of the user. An intelligent recommender\nsystem is not only useful for the user but also for businesses which want to\nlearn the tendencies of their users. Finding out tendencies or intents of a\nuser is a difficult problem to solve.\n Keeping this in mind, we sought out to create an intelligent system which\nwill keep track of the user's activity on a web-application as well as\ndetermine the intent of the user in each session. We devised a way to encode\nthe user's activity through the sessions. Then, we have represented the\ninformation seen by the user in a high dimensional format which is reduced to\nlower dimensions using tensor factorization techniques. The aspect of intent\nawareness (or scoring) is dealt with at this stage. Finally, combining the user\nactivity data with the contextual information gives the recommendation score.\nThe final recommendations are then ranked using filtering and collaborative\nrecommendation techniques to show the top-k recommendations to the user. A\nprovision for feedback is also envisioned in the current system which informs\nthe model to update the various weights in the recommender system. Our overall\nmodel aims to combine both frequency-based and context-based recommendation\nsystems and quantify the intent of a user to provide better recommendations.\n We ran experiments on real-world timestamped user activity data, in the\nsetting of recommending reports to the users of a business analytics tool and\nthe results are better than the baselines. We also tuned certain aspects of our\nmodel to arrive at optimized results.\n","authors":"Biswarup Bhattacharya|Iftikhar Burhanuddin|Abhilasha Sancheti|Kushal Satya","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.10558v1","link_pdf":"http://arxiv.org/pdf/1711.10558v1","link_doi":"http://dx.doi.org/10.1109/ICDMW.2017.8","comment":"Presented at the 5th International Workshop on Data Science and Big\n Data Analytics (DSBDA), 17th IEEE International Conference on Data Mining\n (ICDM) 2017; 8 pages; 4 figures; Due to the limitation \"The abstract field\n cannot be longer than 1,920 characters,\" the abstract appearing here is\n slightly shorter than the one in the PDF file","journal_ref":"","doi":"10.1109/ICDMW.2017.8","primary_category":"cs.IR","categories":"cs.IR|cs.AI|cs.LG|stat.ML"} {"id":"1711.10609v1","submitted":"2017-11-28 23:17:53","updated":"2017-11-28 23:17:53","title":"A recurrent neural network for classification of unevenly sampled\n variable stars","abstract":" Astronomical surveys of celestial sources produce streams of noisy time\nseries measuring flux versus time (\"light curves\"). Unlike in many other\nphysical domains, however, large (and source-specific) temporal gaps in data\narise naturally due to intranight cadence choices as well as diurnal and\nseasonal constraints. With nightly observations of millions of variable stars\nand transients from upcoming surveys, efficient and accurate discovery and\nclassification techniques on noisy, irregularly sampled data must be employed\nwith minimal human-in-the-loop involvement. Machine learning for inference\ntasks on such data traditionally requires the laborious hand-coding of\ndomain-specific numerical summaries of raw data (\"features\"). Here we present a\nnovel unsupervised autoencoding recurrent neural network (RNN) that makes\nexplicit use of sampling times and known heteroskedastic noise properties. When\ntrained on optical variable star catalogs, this network produces supervised\nclassification models that rival other best-in-class approaches. We find that\nautoencoded features learned on one time-domain survey perform nearly as well\nwhen applied to another survey. These networks can continue to learn from new\nunlabeled observations and may be used in other unsupervised tasks such as\nforecasting and anomaly detection.\n","authors":"Brett Naul|Joshua S. Bloom|Fernando Pérez|Stéfan van der Walt","affiliations":"Department of Astronomy, University of California, Berkeley, CA, USA|Department of Astronomy, University of California, Berkeley, CA, USA|Department of Statistics, University of California, Berkeley, CA, USA|Berkeley Institute for Data Science, University of California, Berkeley, CA, USA","link_abstract":"http://arxiv.org/abs/1711.10609v1","link_pdf":"http://arxiv.org/pdf/1711.10609v1","link_doi":"http://dx.doi.org/10.1038/s41550-017-0321-z","comment":"23 pages, 14 figures. The published version is at Nature Astronomy\n (https://www.nature.com/articles/s41550-017-0321-z). Source code for models,\n experiments, and figures at\n https://github.com/bnaul/IrregularTimeSeriesAutoencoderPaper (Zenodo Code\n DOI: 10.5281/zenodo.1045560)","journal_ref":"","doi":"10.1038/s41550-017-0321-z","primary_category":"astro-ph.IM","categories":"astro-ph.IM|astro-ph.SR|physics.data-an"} {"id":"1712.00346v3","submitted":"2017-11-30 07:10:49","updated":"2018-03-13 12:17:23","title":"Bayes Minimax Competitors of Preliminary Test Estimators in k Sample\n Problems","abstract":" In this paper, we consider the estimation of a mean vector of a multivariate\nnormal population where the mean vector is suspected to be nearly equal to mean\nvectors of $k-1$ other populations. As an alternative to the preliminary test\nestimator based on the test statistic for testing hypothesis of equal means, we\nderive empirical and hierarchical Bayes estimators which shrink the sample mean\nvector toward a pooled mean estimator given under the hypothesis. The\nminimaxity of those Bayesian estimators are shown, and their performances are\ninvestigated by simulation.\n","authors":"Ryo Imai|Tatsuya Kubokawa|Malay Ghosh","affiliations":"","link_abstract":"http://arxiv.org/abs/1712.00346v3","link_pdf":"http://arxiv.org/pdf/1712.00346v3","link_doi":"http://dx.doi.org/10.1007/s42081-018-0002-x","comment":"16 pages. arXiv admin note: text overlap with arXiv:1711.10822","journal_ref":"Japanese Journal of Statistics and Data Science June 2018, Volume\n 1, Issue 1, pp 3-21","doi":"10.1007/s42081-018-0002-x","primary_category":"math.ST","categories":"math.ST|stat.TH"} {"id":"1711.11527v2","submitted":"2017-11-30 17:31:56","updated":"2017-12-14 14:37:02","title":"Improved Linear Embeddings via Lagrange Duality","abstract":" Near isometric orthogonal embeddings to lower dimensions are a fundamental\ntool in data science and machine learning. In this paper, we present the\nconstruction of such embeddings that minimizes the maximum distortion for a\ngiven set of points. We formulate the problem as a non convex constrained\noptimization problem. We first construct a primal relaxation and then use the\ntheory of Lagrange duality to create dual relaxation. We also suggest a\npolynomial time algorithm based on the theory of convex optimization to solve\nthe dual relaxation provably. We provide a theoretical upper bound on the\napproximation guarantees for our algorithm, which depends only on the spectral\nproperties of the dataset. We experimentally demonstrate the superiority of our\nalgorithm compared to baselines in terms of the scalability and the ability to\nachieve lower distortion.\n","authors":"Kshiteej Sheth|Dinesh Garg|Anirban Dasgupta","affiliations":"","link_abstract":"http://arxiv.org/abs/1711.11527v2","link_pdf":"http://arxiv.org/pdf/1711.11527v2","link_doi":"","comment":"20 pages","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1712.00544v2","submitted":"2017-12-02 04:13:17","updated":"2017-12-05 11:52:17","title":"Conducting Highly Principled Data Science: A Statistician's Job and Joy","abstract":" Highly Principled Data Science insists on methodologies that are: (1)\nscientifically justified, (2) statistically principled, and (3) computationally\nefficient. An astrostatistics collaboration, together with some reminiscences,\nillustrates the increased roles statisticians can and should play to ensure\nthis trio, and to advance the science of data along the way.\n","authors":"Xiao-Li Meng","affiliations":"","link_abstract":"http://arxiv.org/abs/1712.00544v2","link_pdf":"http://arxiv.org/pdf/1712.00544v2","link_doi":"","comment":"To appear in the special issue on \"The Role of Statistics in the Era\n of Big Data\" in Statistics and Probability Letters (2018)","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT|62P99"} {"id":"1712.00849v2","submitted":"2017-12-03 22:59:57","updated":"2017-12-14 18:50:40","title":"Comment: A brief survey of the current state of play for Bayesian\n computation in data science at Big-Data scale","abstract":" We wish to contribute to the discussion of \"Comparing Consensus Monte Carlo\nStrategies for Distributed Bayesian Computation\" by offering our views on the\ncurrent best methods for Bayesian computation, both at big-data scale and with\nsmaller data sets, as summarized in Table 1. This table is certainly an\nover-simplification of a highly complicated area of research in constant\n(present and likely future) flux, but we believe that constructing summaries of\nthis type is worthwhile despite their drawbacks, if only to facilitate further\ndiscussion.\n","authors":"David Draper|Alexander Terenin","affiliations":"","link_abstract":"http://arxiv.org/abs/1712.00849v2","link_pdf":"http://arxiv.org/pdf/1712.00849v2","link_doi":"","comment":"","journal_ref":"Brazilian Journal of Probability and Statistics 31(4):686-691,\n 2017","doi":"","primary_category":"stat.CO","categories":"stat.CO"} {"id":"1712.04221v1","submitted":"2017-12-12 10:46:27","updated":"2017-12-12 10:46:27","title":"Causal Patterns: Extraction of multiple causal relationships by Mixture\n of Probabilistic Partial Canonical Correlation Analysis","abstract":" In this paper, we propose a mixture of probabilistic partial canonical\ncorrelation analysis (MPPCCA) that extracts the Causal Patterns from two\nmultivariate time series. Causal patterns refer to the signal patterns within\ninteractions of two elements having multiple types of mutually causal\nrelationships, rather than a mixture of simultaneous correlations or the\nabsence of presence of a causal relationship between the elements. In\nmultivariate statistics, partial canonical correlation analysis (PCCA)\nevaluates the correlation between two multivariates after subtracting the\neffect of the third multivariate. PCCA can calculate the Granger Causal- ity\nIndex (which tests whether a time-series can be predicted from an- other\ntime-series), but is not applicable to data containing multiple partial\ncanonical correlations. After introducing the MPPCCA, we propose an\nexpectation-maxmization (EM) algorithm that estimates the parameters and latent\nvariables of the MPPCCA. The MPPCCA is expected to ex- tract multiple partial\ncanonical correlations from data series without any supervised signals to split\nthe data as clusters. The method was then eval- uated in synthetic data\nexperiments. In the synthetic dataset, our method estimated the multiple\npartial canonical correlations more accurately than the existing method. To\ndetermine the types of patterns detectable by the method, experiments were also\nconducted on real datasets. The method estimated the communication patterns In\nmotion-capture data. The MP- PCCA is applicable to various type of signals such\nas brain signals, human communication and nonlinear complex multibody systems.\n","authors":"Hiroki Mori|Keisuke Kawano|Hiroki Yokoyama","affiliations":"","link_abstract":"http://arxiv.org/abs/1712.04221v1","link_pdf":"http://arxiv.org/pdf/1712.04221v1","link_doi":"http://dx.doi.org/10.1109/DSAA.2017.60","comment":"DSAA2017 - The 4th IEEE International Conference on Data Science and\n Advanced Analytics","journal_ref":"Proceedings of the 4th IEEE International Conference on Data\n Science and Advanced Analytics, pp.744-754, 2017","doi":"10.1109/DSAA.2017.60","primary_category":"stat.ME","categories":"stat.ME|stat.ML"} {"id":"1801.05854v1","submitted":"2017-12-15 02:36:07","updated":"2017-12-15 02:36:07","title":"NDlib: a Python Library to Model and Analyze Diffusion Processes Over\n Complex Networks","abstract":" Nowadays the analysis of dynamics of and on networks represents a hot topic\nin the Social Network Analysis playground. To support students, teachers,\ndevelopers and researchers in this work we introduce a novel framework, namely\nNDlib, an environment designed to describe diffusion simulations. NDlib is\ndesigned to be a multi-level ecosystem that can be fruitfully used by different\nuser segments. For this reason, upon NDlib, we designed a simulation server\nthat allows remote execution of experiments as well as an online visualization\ntool that abstracts its programmatic interface and makes available the\nsimulation platform to non-technicians.\n","authors":"Giulio Rossetti|Letizia Milli|Salvatore Rinzivillo|Alina Sirbu|Fosca Giannotti|Dino Pedreschi","affiliations":"","link_abstract":"http://arxiv.org/abs/1801.05854v1","link_pdf":"http://arxiv.org/pdf/1801.05854v1","link_doi":"http://dx.doi.org/10.1007/s41060-017-0086-6","comment":"","journal_ref":"International Journal of Data Science and Analytics, 2018","doi":"10.1007/s41060-017-0086-6","primary_category":"cs.SI","categories":"cs.SI|05C85, 60J60, 90C35|G.2.2; F.2.1"} {"id":"1712.06346v1","submitted":"2017-12-18 11:32:06","updated":"2017-12-18 11:32:06","title":"Can co-location be used as a proxy for face-to-face contacts?","abstract":" Technological advances have led to a strong increase in the number of data\ncollection efforts aimed at measuring co-presence of individuals at different\nspatial resolutions. It is however unclear how much co-presence data can inform\nus on actual face-to-face contacts, of particular interest to study the\nstructure of a population in social groups or for use in data-driven models of\ninformation or epidemic spreading processes. Here, we address this issue by\nleveraging data sets containing high resolution face-to-face contacts as well\nas a coarser spatial localisation of individuals, both temporally resolved, in\nvarious contexts. The co-presence and the face-to-face contact temporal\nnetworks share a number of structural and statistical features, but the former\nis (by definition) much denser than the latter. We thus consider several\ndown-sampling methods that generate surrogate contact networks from the\nco-presence signal and compare them with the real face-to-face data. We show\nthat these surrogate networks reproduce some features of the real data but are\nonly partially able to identify the most central nodes of the face-to-face\nnetwork. We then address the issue of using such down-sampled co-presence data\nin data-driven simulations of epidemic processes, and in identifying efficient\ncontainment strategies. We show that the performance of the various sampling\nmethods strongly varies depending on context. We discuss the consequences of\nour results with respect to data collection strategies and methodologies.\n","authors":"Mathieu Génois|Alain Barrat","affiliations":"","link_abstract":"http://arxiv.org/abs/1712.06346v1","link_pdf":"http://arxiv.org/pdf/1712.06346v1","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-018-0140-1","comment":"","journal_ref":"EPJ Data Science 7:11 (2018)","doi":"10.1140/epjds/s13688-018-0140-1","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.SI|q-bio.PE"} {"id":"1712.07349v1","submitted":"2017-12-20 07:41:45","updated":"2017-12-20 07:41:45","title":"Data Science: A Three Ring Circus or a Big Tent?","abstract":" This is part of a collection of discussion pieces on David Donoho's paper 50\nYears of Data Science, appearing in Volume 26, Issue 4 of the Journal of\nComputational and Graphical Statistics (2017).\n","authors":"Jennifer Bryan|Hadley Wickham","affiliations":"","link_abstract":"http://arxiv.org/abs/1712.07349v1","link_pdf":"http://arxiv.org/pdf/1712.07349v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT"} {"id":"1712.07420v2","submitted":"2017-12-20 11:24:50","updated":"2018-07-23 13:57:50","title":"Finding Competitive Network Architectures Within a Day Using UCT","abstract":" The design of neural network architectures for a new data set is a laborious\ntask which requires human deep learning expertise. In order to make deep\nlearning available for a broader audience, automated methods for finding a\nneural network architecture are vital. Recently proposed methods can already\nachieve human expert level performances. However, these methods have run times\nof months or even years of GPU computing time, ignoring hardware constraints as\nfaced by many researchers and companies. We propose the use of Monte Carlo\nplanning in combination with two different UCT (upper confidence bound applied\nto trees) derivations to search for network architectures. We adapt the UCT\nalgorithm to the needs of network architecture search by proposing two ways of\nsharing information between different branches of the search tree. In an\nempirical study we are able to demonstrate that this method is able to find\ncompetitive networks for MNIST, SVHN and CIFAR-10 in just a single GPU day.\nExtending the search time to five GPU days, we are able to outperform human\narchitectures and our competitors which consider the same types of layers.\n","authors":"Martin Wistuba","affiliations":"","link_abstract":"http://arxiv.org/abs/1712.07420v2","link_pdf":"http://arxiv.org/pdf/1712.07420v2","link_doi":"http://dx.doi.org/10.1109/DSAA.2018.00037","comment":"","journal_ref":"Proceedings of the 5th IEEE International Conference on Data\n Science and Advanced Analytics, pages 263-272, 2018","doi":"10.1109/DSAA.2018.00037","primary_category":"cs.LG","categories":"cs.LG|cs.CV|stat.ML"} {"id":"1801.00253v2","submitted":"2017-12-31 08:43:33","updated":"2018-08-06 12:45:49","title":"Global Income Inequality and Savings: A Data Science Perspective","abstract":" A society or country with income equally distributed among its people is\ntruly a fiction! The phenomena of socioeconomic inequalities have been plaguing\nmankind from times immemorial. We are interested in gaining an insight about\nthe co-evolution of the countries in the inequality space, from a data science\nperspective. For this purpose, we use the time series data for Gini indices of\ndifferent countries, and construct the equal-time cross-correlation matrix. We\nthen use this to construct a similarity matrix and generate a map with the\ncountries as different points generated through a multi-dimensional scaling\ntechnique. We also produce a similar map of different countries using the time\nseries data for Gross Domestic Savings (% of GDP). We also pose a different,\nyet significant, question: Can higher savings moderate the income inequality?\nIn this paper, we have tried to address this question through another data\nscience technique - linear regression, to seek an empirical linkage between the\nincome inequality and savings, mainly for relatively small or closed economies.\nThis question was inspired from an existing theoretical model proposed by\nChakraborti-Chakrabarti (2000), based on the principle of kinetic theory of\ngases. We tested our model empirically using Gini index and Gross Domestic\nSavings, and observed that the model holds reasonably true for many economies\nof the world.\n","authors":"Kiran Sharma|Subhradeep Das|Anirban Chakraborti","affiliations":"","link_abstract":"http://arxiv.org/abs/1801.00253v2","link_pdf":"http://arxiv.org/pdf/1801.00253v2","link_doi":"","comment":"8 pages, 6 figures. IEEE format. Accepted for publication in 5th IEEE\n DSAA 2018 conference at Torino, Italy","journal_ref":"","doi":"","primary_category":"q-fin.GN","categories":"q-fin.GN|physics.soc-ph"} {"id":"1801.00371v2","submitted":"2017-12-31 23:00:24","updated":"2018-05-01 02:55:32","title":"Data Science vs. Statistics: Two Cultures?","abstract":" Data science is the business of learning from data, which is traditionally\nthe business of statistics. Data science, however, is often understood as a\nbroader, task-driven and computationally-oriented version of statistics. Both\nthe term data science and the broader idea it conveys have origins in\nstatistics and are a reaction to a narrower view of data analysis. Expanding\nupon the views of a number of statisticians, this paper encourages a big-tent\nview of data analysis. We examine how evolving approaches to modern data\nanalysis relate to the existing discipline of statistics (e.g. exploratory\nanalysis, machine learning, reproducibility, computation, communication and the\nrole of theory). Finally, we discuss what these trends mean for the future of\nstatistics by highlighting promising directions for communication, education\nand research.\n","authors":"Iain Carmichael|J. S. Marron","affiliations":"","link_abstract":"http://arxiv.org/abs/1801.00371v2","link_pdf":"http://arxiv.org/pdf/1801.00371v2","link_doi":"http://dx.doi.org/10.1007/s42081-018-0009-3","comment":"","journal_ref":"","doi":"10.1007/s42081-018-0009-3","primary_category":"stat.OT","categories":"stat.OT"} {"id":"1801.00753v3","submitted":"2018-01-02 18:08:49","updated":"2019-05-07 14:30:27","title":"Probabilistic supervised learning","abstract":" Predictive modelling and supervised learning are central to modern data\nscience. With predictions from an ever-expanding number of supervised black-box\nstrategies - e.g., kernel methods, random forests, deep learning aka neural\nnetworks - being employed as a basis for decision making processes, it is\ncrucial to understand the statistical uncertainty associated with these\npredictions.\n As a general means to approach the issue, we present an overarching framework\nfor black-box prediction strategies that not only predict the target but also\ntheir own predictions' uncertainty. Moreover, the framework allows for fair\nassessment and comparison of disparate prediction strategies. For this, we\nformally consider strategies capable of predicting full distributions from\nfeature variables, so-called probabilistic supervised learning strategies.\n Our work draws from prior work including Bayesian statistics, information\ntheory, and modern supervised machine learning, and in a novel synthesis leads\nto (a) new theoretical insights such as a probabilistic bias-variance\ndecomposition and an entropic formulation of prediction, as well as to (b) new\nalgorithms and meta-algorithms, such as composite prediction strategies,\nprobabilistic boosting and bagging, and a probabilistic predictive independence\ntest.\n Our black-box formulation also leads (c) to a new modular interface view on\nprobabilistic supervised learning and a modelling workflow API design, which we\nhave implemented in the newly released skpro machine learning toolbox,\nextending the familiar modelling interface and meta-modelling functionality of\nsklearn. The skpro package provides interfaces for construction, composition,\nand tuning of probabilistic supervised learning strategies, together with\norchestration features for validation and comparison of any such strategy - be\nit frequentist, Bayesian, or other.\n","authors":"Frithjof Gressmann|Franz J. Király|Bilal Mateen|Harald Oberhauser","affiliations":"","link_abstract":"http://arxiv.org/abs/1801.00753v3","link_pdf":"http://arxiv.org/pdf/1801.00753v3","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG|math.ST|stat.ME|stat.TH"} {"id":"1801.02961v2","submitted":"2018-01-06 23:17:24","updated":"2019-09-29 14:01:32","title":"Representation Learning with Autoencoders for Electronic Health Records:\n A Comparative Study","abstract":" Increasing volume of Electronic Health Records (EHR) in recent years provides\ngreat opportunities for data scientists to collaborate on different aspects of\nhealthcare research by applying advanced analytics to these EHR clinical data.\nA key requirement however is obtaining meaningful insights from high\ndimensional, sparse and complex clinical data. Data science approaches\ntypically address this challenge by performing feature learning in order to\nbuild more reliable and informative feature representations from clinical data\nfollowed by supervised learning. In this paper, we propose a predictive\nmodeling approach based on deep learning based feature representations and word\nembedding techniques. Our method uses different deep architectures (stacked\nsparse autoencoders, deep belief network, adversarial autoencoders and\nvariational autoencoders) for feature representation in higher-level\nabstraction to obtain effective and robust features from EHRs, and then build\nprediction models on top of them. Our approach is particularly useful when the\nunlabeled data is abundant whereas labeled data is scarce. We investigate the\nperformance of representation learning through a supervised learning approach.\nOur focus is to present a comparative study to evaluate the performance of\ndifferent deep architectures through supervised learning and provide insights\nin the choice of deep feature representation techniques. Our experiments\ndemonstrate that for small data sets, stacked sparse autoencoder demonstrates a\nsuperior generality performance in prediction due to sparsity regularization\nwhereas variational autoencoders outperform the competing approaches for large\ndata sets due to its capability of learning the representation distribution.\n","authors":"Najibesadat Sadati|Milad Zafar Nezhad|Ratna Babu Chinnam|Dongxiao Zhu","affiliations":"","link_abstract":"http://arxiv.org/abs/1801.02961v2","link_pdf":"http://arxiv.org/pdf/1801.02961v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1801.02874v1","submitted":"2018-01-09 10:47:09","updated":"2018-01-09 10:47:09","title":"Are the different layers of a social network conveying the same\n information?","abstract":" Comprehensive and quantitative investigations of social theories and\nphenomena increasingly benefit from the vast breadth of data describing human\nsocial relations, which is now available within the realm of computational\nsocial science. Such data are, however, typically proxies for one of the many\ninteraction layers composing social networks, which can be defined in many ways\nand are typically composed of communication of various types (e.g., phone\ncalls, face-to-face communication, etc.). As a result, many studies focus on\none single layer, corresponding to the data at hand. Several studies have,\nhowever, shown that these layers are not interchangeable, despite the presence\nof a certain level of correlations between them. Here, we investigate whether\ndifferent layers of interactions among individuals lead to similar conclusions\nwith respect to the presence of homophily patterns in a population---homophily\nrepresents one of the widest studied phenomenon in social networks. To this\naim, we consider a dataset describing interactions and links of various nature\nin a population of Asian students with diverse nationalities, first language\nand gender. We study homophily patterns, as well as their temporal evolutions\nin each layer of the social network. To facilitate our analysis, we put forward\na general method to assess whether the homophily patterns observed in one layer\ninform us about patterns in another layer. For instance, our study reveals that\nthree network layers---cell phone communications, questionnaires about\nfriendship, and trust relations---lead to similar and consistent results\ndespite some minor discrepancies. The homophily patterns of the co-presence\nnetwork layer, however, does not yield any meaningful information about other\nnetwork layers.\n","authors":"Ajaykumar Manivannan|W. Quin Yow|Roland Bouffanais|Alain Barrat","affiliations":"","link_abstract":"http://arxiv.org/abs/1801.02874v1","link_pdf":"http://arxiv.org/pdf/1801.02874v1","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-018-0161-9","comment":"","journal_ref":"EPJ Data Science 7:34 (2018)","doi":"10.1140/epjds/s13688-018-0161-9","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.SI"} {"id":"1801.05098v1","submitted":"2018-01-16 02:21:13","updated":"2018-01-16 02:21:13","title":"IAU WG, Data-driven Astronomy Education and Public Outreach,current\n status and working plans","abstract":" IAU Inter-Commission B2-C1-C2 WG Data-driven Astronomy Education and Public\nOutreach (DAEPO) was launched officially in April 2017. With the development of\nmany mega-science astronomical projects, for example CTA, DESI, EUCLID, FAST,\nGAIA, JWST, LAMOST, LSST, SDSS, SKA, and large scale simulations, astronomy has\nbecome a Big Data science. Astronomical data is not only necessary resource for\nscientific research, but also very valuable resource for education and public\noutreach (EPO), especially in the era of Internet and Cloud Computing. IAU WG\nData-driven Astronomy Education and Public Outreach is hosted at the IAU\nDivision B (Facilities, Technologies and Data Science) Commission B2 (Data and\nDocumentation), and organized jointly with Commission C1 (Astronomy Education\nand Development), Commission C2 (Communicating Astronomy with the Public),\nOffice of Astronomy for Development (OAD), Office for Astronomy Outreach (OAO)\nand several other non IAU communities, including IVOA Education Interest Group,\nAmerican Astronomical Society Worldwide Telescope Advisory Board, Zooniverse\nproject and International Planetarium Society. The working group has the major\nobjectives to: Act as a forum to discuss the value of astronomy data in EPO,\nthe advantages and benefits of data driven EPO, and the challenges facing to\ndata driven EPO; Provide guidelines, curriculum, data resources, tools, and\ne-infrastructure for data driven EPO; Provide best practices of data driven\nEPO. In the paper, backgrounds, current status and working plans in the future\nare introduced. More information about the WG is available at:\nhttp://daepo.china-vo.org/\n","authors":"Chenzhou Cui|Shanshan Li","affiliations":"","link_abstract":"http://arxiv.org/abs/1801.05098v1","link_pdf":"http://arxiv.org/pdf/1801.05098v1","link_doi":"","comment":"4 pages, presented at the Astronomical Data Analysis Software and\n Systems (ADASS) XXVII conference, Santiago, Chile, October 2017","journal_ref":"","doi":"","primary_category":"astro-ph.IM","categories":"astro-ph.IM"} {"id":"1801.05372v4","submitted":"2018-01-16 17:18:29","updated":"2019-06-15 05:25:25","title":"Neural Feature Learning From Relational Database","abstract":" Feature engineering is one of the most important but most tedious tasks in\ndata science. This work studies automation of feature learning from relational\ndatabase. We first prove theoretically that finding the optimal features from\nrelational data for predictive tasks is NP-hard. We propose an efficient\nrule-based approach based on heuristics and a deep neural network to\nautomatically learn appropriate features from relational data. We benchmark our\napproaches in ensembles in past Kaggle competitions. Our new approach wins late\nmedals and beats the state-of-the-art solutions with significant margins. To\nthe best of our knowledge, this is the first time an automated data science\nsystem could win medals in Kaggle competitions with complex relational\ndatabase.\n","authors":"Hoang Thanh Lam|Tran Ngoc Minh|Mathieu Sinn|Beat Buesser|Martin Wistuba","affiliations":"","link_abstract":"http://arxiv.org/abs/1801.05372v4","link_pdf":"http://arxiv.org/pdf/1801.05372v4","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI|cs.LG"} {"id":"1801.05627v2","submitted":"2018-01-17 11:48:18","updated":"2018-04-03 09:06:42","title":"On the Reduction of Biases in Big Data Sets for the Detection of\n Irregular Power Usage","abstract":" In machine learning, a bias occurs whenever training sets are not\nrepresentative for the test data, which results in unreliable models. The most\ncommon biases in data are arguably class imbalance and covariate shift. In this\nwork, we aim to shed light on this topic in order to increase the overall\nattention to this issue in the field of machine learning. We propose a scalable\nnovel framework for reducing multiple biases in high-dimensional data sets in\norder to train more reliable predictors. We apply our methodology to the\ndetection of irregular power usage from real, noisy industrial data. In\nemerging markets, irregular power usage, and electricity theft in particular,\nmay range up to 40% of the total electricity distributed. Biased data sets are\nof particular issue in this domain. We show that reducing these biases\nincreases the accuracy of the trained predictors. Our models have the potential\nto generate significant economic value in a real world application, as they are\nbeing deployed in a commercial software for the detection of irregular power\nusage.\n","authors":"Patrick Glauner|Radu State|Petko Valtchev|Diogo Duarte","affiliations":"","link_abstract":"http://arxiv.org/abs/1801.05627v2","link_pdf":"http://arxiv.org/pdf/1801.05627v2","link_doi":"","comment":"","journal_ref":"Proceedings of the 13th International FLINS Conference on Data\n Science and Knowledge Engineering for Sensing Decision Support (FLINS 2018)","doi":"","primary_category":"cs.LG","categories":"cs.LG"} {"id":"1801.05935v1","submitted":"2018-01-18 04:50:42","updated":"2018-01-18 04:50:42","title":"Computation of the Maximum Likelihood estimator in low-rank Factor\n Analysis","abstract":" Factor analysis, a classical multivariate statistical technique is popularly\nused as a fundamental tool for dimensionality reduction in statistics,\neconometrics and data science. Estimation is often carried out via the Maximum\nLikelihood (ML) principle, which seeks to maximize the likelihood under the\nassumption that the positive definite covariance matrix can be decomposed as\nthe sum of a low rank positive semidefinite matrix and a diagonal matrix with\nnonnegative entries. This leads to a challenging rank constrained nonconvex\noptimization problem. We reformulate the low rank ML Factor Analysis problem as\na nonlinear nonsmooth semidefinite optimization problem, study various\nstructural properties of this reformulation and propose fast and scalable\nalgorithms based on difference of convex (DC) optimization. Our approach has\ncomputational guarantees, gracefully scales to large problems, is applicable to\nsituations where the sample covariance matrix is rank deficient and adapts to\nvariants of the ML problem with additional constraints on the problem\nparameters. Our numerical experiments demonstrate the significant usefulness of\nour approach over existing state-of-the-art approaches.\n","authors":"Koulik Khamaru|Rahul Mazumder","affiliations":"","link_abstract":"http://arxiv.org/abs/1801.05935v1","link_pdf":"http://arxiv.org/pdf/1801.05935v1","link_doi":"","comment":"22 pages, 4 figures","journal_ref":"","doi":"","primary_category":"math.OC","categories":"math.OC|stat.CO|stat.ML"} {"id":"1801.06814v1","submitted":"2018-01-21 13:20:07","updated":"2018-01-21 13:20:07","title":"Curriculum Guidelines for Undergraduate Programs in Data Science","abstract":" The Park City Math Institute (PCMI) 2016 Summer Undergraduate Faculty Program\nmet for the purpose of composing guidelines for undergraduate programs in Data\nScience. The group consisted of 25 undergraduate faculty from a variety of\ninstitutions in the U.S., primarily from the disciplines of mathematics,\nstatistics and computer science. These guidelines are meant to provide some\nstructure for institutions planning for or revising a major in Data Science.\n","authors":"Richard De Veaux|Mahesh Agarwal|Maia Averett|Benjamin Baumer|Andrew Bray|Thomas Bressoud|Lance Bryant|Lei Cheng|Amanda Francis|Robert Gould|Albert Y. Kim|Matt Kretchmar|Qin Lu|Ann Moskol|Deborah Nolan|Roberto Pelayo|Sean Raleigh|Ricky J. Sethi|Mutiara Sondjaja|Neelesh Tiruviluamala|Paul Uhlig|Talitha Washington|Curtis Wesley|David White|Ping Ye","affiliations":"","link_abstract":"http://arxiv.org/abs/1801.06814v1","link_pdf":"http://arxiv.org/pdf/1801.06814v1","link_doi":"http://dx.doi.org/10.1146/annurev-statistics-060116-053930","comment":"","journal_ref":"Annual Review of Statistics, Volume 4 (2017), 15-30","doi":"10.1146/annurev-statistics-060116-053930","primary_category":"stat.OT","categories":"stat.OT"} {"id":"1801.07413v1","submitted":"2018-01-23 07:15:15","updated":"2018-01-23 07:15:15","title":"Greed is Still Good: Maximizing Monotone Submodular+Supermodular\n Functions","abstract":" We analyze the performance of the greedy algorithm, and also a discrete\nsemi-gradient based algorithm, for maximizing the sum of a suBmodular and\nsuPermodular (BP) function (both of which are non-negative monotone\nnon-decreasing) under two types of constraints, either a cardinality constraint\nor $p\\geq 1$ matroid independence constraints. These problems occur naturally\nin several real-world applications in data science, machine learning, and\nartificial intelligence. The problems are ordinarily inapproximable to any\nfactor (as we show). Using the curvature $\\kappa_f$ of the submodular term, and\nintroducing $\\kappa^g$ for the supermodular term (a natural dual curvature for\nsupermodular functions), however, both of which are computable in linear time,\nwe show that BP maximization can be efficiently approximated by both the greedy\nand the semi-gradient based algorithm. The algorithms yield multiplicative\nguarantees of $\\frac{1}{\\kappa_f}\\left[1-e^{-(1-\\kappa^g)\\kappa_f}\\right]$ and\n$\\frac{1-\\kappa^g}{(1-\\kappa^g)\\kappa_f + p}$ for the two types of constraints\nrespectively. For pure monotone supermodular constrained maximization, these\nyield $1-\\kappa^g$ and $(1-\\kappa^g)/p$ for the two types of constraints\nrespectively. We also analyze the hardness of BP maximization and show that our\nguarantees match hardness by a constant factor and by $O(\\ln(p))$ respectively.\nComputational experiments are also provided supporting our analysis.\n","authors":"Wenruo Bai|Jeffrey A. Bilmes","affiliations":"","link_abstract":"http://arxiv.org/abs/1801.07413v1","link_pdf":"http://arxiv.org/pdf/1801.07413v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DM","categories":"cs.DM"} {"id":"1802.00565v2","submitted":"2018-02-02 05:45:21","updated":"2018-02-09 17:14:09","title":"Detecting Zones and Threat on 3D Body for Security in Airports using\n Deep Machine Learning","abstract":" In this research, it was used a segmentation and classification method to\nidentify threat recognition in human scanner images of airport security. The\nDepartment of Homeland Security's (DHS) in USA has a higher false alarm,\nproduced from theirs algorithms using today's scanners at the airports. To\nrepair this problem they started a new competition at Kaggle site asking the\nscience community to improve their detection with new algorithms. The dataset\nused in this research comes from DHS at\nhttps://www.kaggle.com/c/passenger-screening-algorithm-challenge/data According\nto DHS: \"This dataset contains a large number of body scans acquired by a new\ngeneration of millimeter wave scanner called the High Definition-Advanced\nImaging Technology (HD-AIT) system. They are comprised of volunteers wearing\ndifferent clothing types (from light summer clothes to heavy winter clothes),\ndifferent body mass indices, different genders, different numbers of threats,\nand different types of threats\". Using Python as a principal language, the\npreprocessed of the dataset images extracted features from 200 bodies using:\nintensity, intensity differences and local neighbourhood to detect, to produce\nsegmentation regions and label those regions to be used as a truth in a\ntraining and test dataset. The regions are subsequently give to a CNN deep\nlearning classifier to predict 17 classes (that represents the body zones):\nzone1, zone2, ... zone17 and zones with threat in a total of 34 zones. The\nanalysis showed the results of the classifier an accuracy of 98.2863% and a\nloss of 0.091319, as well as an average of 100% for recall and precision.\n","authors":"Abel Ag Rb Guimaraes|Ghassem Tofighi","affiliations":"","link_abstract":"http://arxiv.org/abs/1802.00565v2","link_pdf":"http://arxiv.org/pdf/1802.00565v2","link_doi":"http://dx.doi.org/10.5281/zenodo.1189345","comment":"7 pages, 17 figures, This article was accepted from the Star\n Conference, Data Science and Big Data Analyses MAY 24-25, 2018 | Toronto,\n Canada","journal_ref":"","doi":"10.5281/zenodo.1189345","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1802.04253v2","submitted":"2018-02-11 00:24:32","updated":"2018-05-23 04:05:40","title":"Global Model Interpretation via Recursive Partitioning","abstract":" In this work, we propose a simple but effective method to interpret black-box\nmachine learning models globally. That is, we use a compact binary tree, the\ninterpretation tree, to explicitly represent the most important decision rules\nthat are implicitly contained in the black-box machine learning models. This\ntree is learned from the contribution matrix which consists of the\ncontributions of input variables to predicted scores for each single\nprediction. To generate the interpretation tree, a unified process recursively\npartitions the input variable space by maximizing the difference in the average\ncontribution of the split variable between the divided spaces. We demonstrate\nthe effectiveness of our method in diagnosing machine learning models on\nmultiple tasks. Also, it is useful for new knowledge discovery as such insights\nare not easily identifiable when only looking at single predictions. In\ngeneral, our work makes it easier and more efficient for human beings to\nunderstand machine learning models.\n","authors":"Chengliang Yang|Anand Rangarajan|Sanjay Ranka","affiliations":"","link_abstract":"http://arxiv.org/abs/1802.04253v2","link_pdf":"http://arxiv.org/pdf/1802.04253v2","link_doi":"","comment":"Accepted by The 4th IEEE International Conference on Data Science and\n Systems (DSS-2018)","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.AI|stat.ML"} {"id":"1802.05982v1","submitted":"2018-02-15 10:54:31","updated":"2018-02-15 10:54:31","title":"Residual-Based Detections and Unified Architecture for Massive MIMO\n Uplink","abstract":" Massive multiple-input multiple-output (M-MIMO) technique brings better\nenergy efficiency and coverage but higher computational complexity than\nsmall-scale MIMO. For linear detections such as minimum mean square error\n(MMSE), prohibitive complexity lies in solving large-scale linear equations.\nFor a better trade-off between bit-error-rate (BER) performance and\ncomputational complexity, iterative linear algorithms like conjugate gradient\n(CG) have been applied and have shown their feasibility in recent years. In\nthis paper, residual-based detection (RBD) algorithms are proposed for M-MIMO\ndetection, including minimal residual (MINRES) algorithm, generalized minimal\nresidual (GMRES) algorithm, and conjugate residual (CR) algorithm. RBD\nalgorithms focus on the minimization of residual norm per iteration, whereas\nmost existing algorithms focus on the approximation of exact signal. Numerical\nresults have shown that, for $64$-QAM $128\\times 8$ MIMO, RBD algorithms are\nonly $0.13$ dB away from the exact matrix inversion method when BER$=10^{-4}$.\nStability of RBD algorithms has also been verified in various correlation\nconditions. Complexity comparison has shown that, CR algorithm require $87\\%$\nless complexity than the traditional method for $128\\times 60$ MIMO. The\nunified hardware architecture is proposed with flexibility, which guarantees a\nlow-complexity implementation for a family of RBD M-MIMO detectors.\n","authors":"Chuan Zhang|Yufeng Yang|Shunqing Zhang|Zaichen Zhang|Xiaohu You","affiliations":"Lab of Efficient Architectures for Digital-communication and Signal-processing|Lab of Efficient Architectures for Digital-communication and Signal-processing|Shanghai Institute for Advanced Communications and Data Science, Shanghai University, Shanghai, China|National Mobile Communications Research Laboratory|National Mobile Communications Research Laboratory","link_abstract":"http://arxiv.org/abs/1802.05982v1","link_pdf":"http://arxiv.org/pdf/1802.05982v1","link_doi":"","comment":"submitted to Journal of Signal Processing Systems","journal_ref":"","doi":"","primary_category":"eess.SP","categories":"eess.SP|cs.AR|cs.CE|cs.NA"} {"id":"1802.05792v2","submitted":"2018-02-15 23:24:39","updated":"2019-04-28 09:53:24","title":"Masked Conditional Neural Networks for Automatic Sound Events\n Recognition","abstract":" Deep neural network architectures designed for application domains other than\nsound, especially image recognition, may not optimally harness the\ntime-frequency representation when adapted to the sound recognition problem. In\nthis work, we explore the ConditionaL Neural Network (CLNN) and the Masked\nConditionaL Neural Network (MCLNN) for multi-dimensional temporal signal\nrecognition. The CLNN considers the inter-frame relationship, and the MCLNN\nenforces a systematic sparseness over the network's links to enable learning in\nfrequency bands rather than bins allowing the network to be frequency shift\ninvariant mimicking a filterbank. The mask also allows considering several\ncombinations of features concurrently, which is usually handcrafted through\nexhaustive manual search. We applied the MCLNN to the environmental sound\nrecognition problem using the ESC-10 and ESC-50 datasets. MCLNN achieved\ncompetitive performance, using 12% of the parameters and without augmentation,\ncompared to state-of-the-art Convolutional Neural Networks.\n","authors":"Fady Medhat|David Chesmore|John Robinson","affiliations":"","link_abstract":"http://arxiv.org/abs/1802.05792v2","link_pdf":"http://arxiv.org/pdf/1802.05792v2","link_doi":"http://dx.doi.org/10.1109/DSAA.2017.43","comment":"Restricted Boltzmann Machine, RBM, Conditional RBM, CRBM, Deep Belief\n Net, DBN, Conditional Neural Network, CLNN, Masked Conditional Neural\n Network, MCLNN, Environmental Sound Recognition, ESR","journal_ref":"IEEE International Conference on Data Science and Advanced\n Analytics (DSAA) Year: 2017, Pages: 389 - 394","doi":"10.1109/DSAA.2017.43","primary_category":"cs.LG","categories":"cs.LG|cs.SD|eess.AS|stat.ML"} {"id":"1802.10444v1","submitted":"2018-02-21 08:51:48","updated":"2018-02-21 08:51:48","title":"On the Low-Complexity, Hardware-Friendly Tridiagonal Matrix Inversion\n for Correlated Massive MIMO Systems","abstract":" In massive MIMO (M-MIMO) systems, one of the key challenges in the\nimplementation is the large-scale matrix inversion operation, as widely used in\nchannel estimation, equalization, detection, and decoding procedures.\nTraditionally, to handle this complexity issue, several low-complexity matrix\ninversion approximation methods have been proposed, including the classic\nCholesky decomposition and the Neumann series expansion (NSE). However, the\nconventional approaches failed to exploit neither the special structure of\nchannel matrices nor the critical issues in the hardware implementation, which\nresults in poorer throughput performance and longer processing delay. In this\npaper, by targeting at the correlated M-MIMO systems, we propose a modified NSE\nbased on tridiagonal matrix inversion approximation (TMA) to accommodate the\ncomplexity as well as the performance issue in the conventional hardware\nimplementation, and analyze the corresponding approximation errors. Meanwhile,\nwe investigate the VLSI implementation for the proposed detection algorithm\nbased on a Xilinx Virtex-7 XC7VX690T FPGA platform. It is shown that for\ncorrelated massive MIMO systems, it can achieve near-MMSE performance and $630$\nMb/s throughput. Compared with other benchmark systems, the proposed pipelined\nTMA detector can get high throughput-to-hardware ratio. Finally, we also\npropose a fast iteration structure for further research.\n","authors":"Chuan Zhang|Xiao Liang|Zhizhen Wu|Feng Wang|Shunqing Zhang|Zaichen Zhang|Xiaohu You","affiliations":"Lab of Efficient Architectures for Digital-communication and Signal-processing|Lab of Efficient Architectures for Digital-communication and Signal-processing|Lab of Efficient Architectures for Digital-communication and Signal-processing|Lab of Efficient Architectures for Digital-communication and Signal-processing|Shanghai Institute for Advanced Communications and Data Science, Shanghai University, Shanghai, China|National Mobile Communications Research Laboratory|National Mobile Communications Research Laboratory","link_abstract":"http://arxiv.org/abs/1802.10444v1","link_pdf":"http://arxiv.org/pdf/1802.10444v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"eess.SP","categories":"eess.SP|cs.AR|cs.NA|math.NA"} {"id":"1802.08363v2","submitted":"2018-02-23 02:24:14","updated":"2018-09-08 13:15:48","title":"An efficient $k$-means-type algorithm for clustering datasets with\n incomplete records","abstract":" The $k$-means algorithm is arguably the most popular nonparametric clustering\nmethod but cannot generally be applied to datasets with incomplete records. The\nusual practice then is to either impute missing values under an assumed\nmissing-completely-at-random mechanism or to ignore the incomplete records, and\napply the algorithm on the resulting dataset. We develop an efficient version\nof the $k$-means algorithm that allows for clustering in the presence of\nincomplete records. Our extension is called $k_m$-means and reduces to the\n$k$-means algorithm when all records are complete. We also provide\ninitialization strategies for our algorithm and methods to estimate the number\nof groups in the dataset. Illustrations and simulations demonstrate the\nefficacy of our approach in a variety of settings and patterns of missing data.\nOur methods are also applied to the analysis of activation images obtained from\na functional Magnetic Resonance Imaging experiment.\n","authors":"Andrew Lithio|Ranjan Maitra","affiliations":"","link_abstract":"http://arxiv.org/abs/1802.08363v2","link_pdf":"http://arxiv.org/pdf/1802.08363v2","link_doi":"http://dx.doi.org/10.1002/sam.11392","comment":"21 pages, 12 figures, 3 tables, in press, Statistical Analysis and\n Data Mining -- The ASA Data Science Journal, 2018","journal_ref":"","doi":"10.1002/sam.11392","primary_category":"stat.ML","categories":"stat.ML|astro-ph.HE|cs.LG|stat.CO|stat.ME"} {"id":"1802.08858v1","submitted":"2018-02-24 14:54:19","updated":"2018-02-24 14:54:19","title":"A Project Based Approach to Statistics and Data Science","abstract":" In an increasingly data-driven world, facility with statistics is more\nimportant than ever for our students. At institutions without a statistician,\nit often falls to the mathematics faculty to teach statistics courses. This\npaper presents a model that a mathematician asked to teach statistics can\nfollow. This model entails connecting with faculty from numerous departments on\ncampus to develop a list of topics, building a repository of real-world\ndatasets from these faculty, and creating projects where students interface\nwith these datasets to write lab reports aimed at consumers of statistics in\nother disciplines. The end result is students who are well prepared for\ninterdisciplinary research, who are accustomed to coping with the\nidiosyncrasies of real data, and who have sharpened their technical writing and\nspeaking skills.\n","authors":"David White","affiliations":"","link_abstract":"http://arxiv.org/abs/1802.08858v1","link_pdf":"http://arxiv.org/pdf/1802.08858v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT"} {"id":"1803.00567v4","submitted":"2018-03-01 18:28:43","updated":"2020-03-18 09:54:55","title":"Computational Optimal Transport","abstract":" Optimal transport (OT) theory can be informally described using the words of\nthe French mathematician Gaspard Monge (1746-1818): A worker with a shovel in\nhand has to move a large pile of sand lying on a construction site. The goal of\nthe worker is to erect with all that sand a target pile with a prescribed shape\n(for example, that of a giant sand castle). Naturally, the worker wishes to\nminimize her total effort, quantified for instance as the total distance or\ntime spent carrying shovelfuls of sand. Mathematicians interested in OT cast\nthat problem as that of comparing two probability distributions, two different\npiles of sand of the same volume. They consider all of the many possible ways\nto morph, transport or reshape the first pile into the second, and associate a\n\"global\" cost to every such transport, using the \"local\" consideration of how\nmuch it costs to move a grain of sand from one place to another. Recent years\nhave witnessed the spread of OT in several fields, thanks to the emergence of\napproximate solvers that can scale to sizes and dimensions that are relevant to\ndata sciences. Thanks to this newfound scalability, OT is being increasingly\nused to unlock various problems in imaging sciences (such as color or texture\nprocessing), computer vision and graphics (for shape manipulation) or machine\nlearning (for regression, classification and density fitting). This short book\nreviews OT with a bias toward numerical methods and their applications in data\nsciences, and sheds lights on the theoretical properties of OT that make it\nparticularly useful for some of these applications.\n","authors":"Gabriel Peyré|Marco Cuturi","affiliations":"","link_abstract":"http://arxiv.org/abs/1803.00567v4","link_pdf":"http://arxiv.org/pdf/1803.00567v4","link_doi":"","comment":"new version with corrected typo in Eq. 4.43 and 4.44 (minus sign in\n front of f, g now changed to +) a few more corrected typos","journal_ref":"Foundations and Trends in Machine Learning, vol. 11, no. 5-6, pp.\n 355-607, 2019","doi":"","primary_category":"stat.ML","categories":"stat.ML"} {"id":"1803.05991v1","submitted":"2018-03-07 05:52:50","updated":"2018-03-07 05:52:50","title":"Big data analytics: The stakes for students, scientists & managers - a\n management perspective","abstract":" For a developing nation, deploying big data (BD) technology and introducing\ndata science in higher education is a challenge. A pessimistic scenario is:\nMis-use of data in many possible ways, waste of trained manpower, poor BD\ncertifications from institutes, under-utilization of resources, disgruntled\nmanagement staff, unhealthy competition in the market, poor integration with\nexisting technical infrastructures. Also, the questions in the minds of\nstudents, scientists, engineers, teachers and managers deserve wider attention.\nBesides the stated perceptions and analyses perhaps ignoring socio-political\nand scientific temperaments in developing nations, the following questions\narise: How did the BD phenomenon naturally occur, post technological\ndevelopments in Computer and Communications Technology and how did different\nexperts react to it? Are academicians elsewhere agreeing on the fact that BD is\na new science? Granted that big data science is a new science what are its\nfoundations as compared to conventional topics in Physics, Chemistry or\nBiology? Or, is it similar to astronomy or nuclear science? What are the\ntechnological and engineering implications and how these can be advantageously\nused to augment business intelligence, for example? Will the industry adopt the\nchanges due to tactical advantages? How can BD success stories be carried over\nelsewhere? How will BD affect the Computer Science and other curricula? How\nwill BD benefit different segments of our society on a large scale? To answer\nthese, an appreciation of the BD as a science and as a technology is necessary.\nThis paper presents a quick BD overview, relying on the contemporary\nliterature; it addresses: characterizations of BD and the BD people, the\nbackground required for the students and teachers to join the BD bandwagon, the\nmanagement challenges in embracing BD.\n","authors":"K. Viswanathan Iyer","affiliations":"","link_abstract":"http://arxiv.org/abs/1803.05991v1","link_pdf":"http://arxiv.org/pdf/1803.05991v1","link_doi":"","comment":"Accepted for oral presentation at the forthcoming EeL-2018 conference\n to be held in September 2018 in Singapore","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.DM"} {"id":"1805.04541v2","submitted":"2018-03-07 14:38:04","updated":"2018-09-10 18:17:11","title":"Proceedings of the Workshop on Data Mining for Geophysics and Geology","abstract":" Modern geosciences have to deal with large quantities and a wide variety of\ndata, including 2-D, 3-D and 4-D seismic surveys, well logs generated by\nsensors, detailed lithological records, satellite images and meteorological\nrecords. These data serve important industries, such as the exploration of\nmineral deposits and the production of energy (Oil and Gas, Geothermal, Wind,\nHydroelectric), are important in the study of the earth crust to reduce the\nimpact of earthquakes, in land use planning, and have a fundamental role in\nsustainability.\n The volume of raw data being stored by different earth science archives today\nmakes it impossible to rely on manual examination by scientists. The data\nvolumes resultant of different sources, from terrestrial or aerial to satellite\nsurveys, will reach a terabyte per day by the time all the planned satellites\nare flown. In particular, the oil industry has been using large quantities of\ndata for quite a long time. Although there are published works in this area\nsince the 70s, these days, the ubiquity of computing and sensor devices enables\nthe collection of higher resolution data in real time, giving a new life to a\nmature industrial field. Understanding and finding value in this data has an\nimpact on the efficiency of the operations in the oil and gas production chain.\nEfficiency gains are particularly important since the steep fall in oil prices\nin 2014, and represent an important opportunity for data mining and data\nscience.\n","authors":"Youzuo Lin|Weichang Li|Alipio Jorge|Rui L. Lopes|German Larrazabal|Pablo Guillen","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.04541v2","link_pdf":"http://arxiv.org/pdf/1805.04541v2","link_doi":"","comment":"Hosted at SIAM International Conference on Data Mining (SDM 2018)","journal_ref":"","doi":"","primary_category":"physics.geo-ph","categories":"physics.geo-ph"} {"id":"1803.03104v1","submitted":"2018-03-08 14:31:46","updated":"2018-03-08 14:31:46","title":"Applicability and interpretation of the deterministic weighted cepstral\n distance","abstract":" Quantifying similarity between data objects is an important part of modern\ndata science. Deciding what similarity measure to use is very application\ndependent. In this paper, we combine insights from systems theory and machine\nlearning, and investigate the weighted cepstral distance, which was previously\ndefined for signals coming from ARMA models. We provide an extension of this\ndistance to invertible deterministic linear time invariant single input single\noutput models, and assess its applicability. We show that it can always be\ninterpreted in terms of the poles and zeros of the underlying model, and that,\nin the case of stable, minimum-phase, or unstable, maximum-phase models, a\ngeometrical interpretation in terms of subspace angles can be given. We then\ndevise a method to assess stability and phase-type of the generating models,\nusing only input/output signal information. In this way, we prove a connection\nbetween the extended weighted cepstral distance and a weighted cepstral model\nnorm. In this way, we provide a purely data-driven way to assess different\nunderlying dynamics of input/output signal pairs, without the need for any\nsystem identification step. This can be useful in machine learning tasks such\nas time series clustering. An iPython tutorial is published complementary to\nthis paper, containing implementations of the various methods and algorithms\npresented here, as well as some numerical illustrations of the equivalences\nproven here.\n","authors":"Oliver Lauwers|Bart De Moor","affiliations":"","link_abstract":"http://arxiv.org/abs/1803.03104v1","link_pdf":"http://arxiv.org/pdf/1803.03104v1","link_doi":"","comment":"18 pages, 5 figures, submitted for review to Automatica","journal_ref":"","doi":"","primary_category":"cs.SY","categories":"cs.SY|cs.CV|math.DS|stat.ML"} {"id":"1803.03176v1","submitted":"2018-03-08 16:01:32","updated":"2018-03-08 16:01:32","title":"Modeling Activation Processes in Human Memory to Improve Tag\n Recommendations","abstract":" This thesis was submitted by Dr. Dominik Kowald to the Institute of\nInteractive Systems and Data Science of Graz University of Technology in\nAustria on the 5th of September 2017 for the attainment of the degree\n'Dr.techn'. The supervisors of this thesis have been Prof. Stefanie Lindstaedt\nand Ass.Prof. Elisabeth Lex from Graz University of Technology, and the\nexternal assessor has been Prof. Tobias Ley from Tallinn University.\n In the current enthusiasm around Data Science and Big Data Analytics, it is\nimportant to mention that only theory-guided approaches will truly enable us to\nfully understand why an algorithm works and how specific results can be\nexplained. It was the goal of this dissertation research to follow this path by\ndemonstrating that a recommender system inspired by human memory theory can\nhave a true impact in the field.\n","authors":"Dominik Kowald","affiliations":"","link_abstract":"http://arxiv.org/abs/1803.03176v1","link_pdf":"http://arxiv.org/pdf/1803.03176v1","link_doi":"","comment":"Summary of dissertation on recommender systems submitted to Graz\n University of Technology (Austria)","journal_ref":"","doi":"","primary_category":"cs.IR","categories":"cs.IR"} {"id":"1803.04219v1","submitted":"2018-03-12 12:34:12","updated":"2018-03-12 12:34:12","title":"Data Science Methodology for Cybersecurity Projects","abstract":" Cyber-security solutions are traditionally static and signature-based. The\ntraditional solutions along with the use of analytic models, machine learning\nand big data could be improved by automatically trigger mitigation or provide\nrelevant awareness to control or limit consequences of threats. This kind of\nintelligent solutions is covered in the context of Data Science for\nCyber-security. Data Science provides a significant role in cyber-security by\nutilising the power of data (and big data), high-performance computing and data\nmining (and machine learning) to protect users against cyber-crimes. For this\npurpose, a successful data science project requires an effective methodology to\ncover all issues and provide adequate resources. In this paper, we are\nintroducing popular data science methodologies and will compare them in\naccordance with cyber-security challenges. A comparison discussion has also\ndelivered to explain methodologies strengths and weaknesses in case of\ncyber-security projects.\n","authors":"Farhad Foroughi|Peter Luksch","affiliations":"","link_abstract":"http://arxiv.org/abs/1803.04219v1","link_pdf":"http://arxiv.org/pdf/1803.04219v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.CR"} {"id":"1803.06964v4","submitted":"2018-03-19 14:49:27","updated":"2018-06-16 04:06:47","title":"A modern maximum-likelihood theory for high-dimensional logistic\n regression","abstract":" Every student in statistics or data science learns early on that when the\nsample size largely exceeds the number of variables, fitting a logistic model\nproduces estimates that are approximately unbiased. Every student also learns\nthat there are formulas to predict the variability of these estimates which are\nused for the purpose of statistical inference; for instance, to produce\np-values for testing the significance of regression coefficients. Although\nthese formulas come from large sample asymptotics, we are often told that we\nare on reasonably safe grounds when $n$ is large in such a way that $n \\ge 5p$\nor $n \\ge 10p$. This paper shows that this is far from the case, and\nconsequently, inferences routinely produced by common software packages are\noften unreliable.\n Consider a logistic model with independent features in which $n$ and $p$\nbecome increasingly large in a fixed ratio. Then we show that (1) the MLE is\nbiased, (2) the variability of the MLE is far greater than classically\npredicted, and (3) the commonly used likelihood-ratio test (LRT) is not\ndistributed as a chi-square. The bias of the MLE is extremely problematic as it\nyields completely wrong predictions for the probability of a case based on\nobserved values of the covariates. We develop a new theory, which\nasymptotically predicts (1) the bias of the MLE, (2) the variability of the\nMLE, and (3) the distribution of the LRT. We empirically also demonstrate that\nthese predictions are extremely accurate in finite samples. Further, an\nappealing feature is that these novel predictions depend on the unknown\nsequence of regression coefficients only through a single scalar, the overall\nstrength of the signal. This suggests very concrete procedures to adjust\ninference; we describe one such procedure learning a single parameter from data\nand producing accurate inference\n","authors":"Pragya Sur|Emmanuel J. Candes","affiliations":"","link_abstract":"http://arxiv.org/abs/1803.06964v4","link_pdf":"http://arxiv.org/pdf/1803.06964v4","link_doi":"","comment":"29 pages, 14 figures, 4 tables","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|stat.ME|stat.TH"} {"id":"1803.06992v1","submitted":"2018-03-19 15:31:41","updated":"2018-03-19 15:31:41","title":"Estimating the intrinsic dimension of datasets by a minimal neighborhood\n information","abstract":" Analyzing large volumes of high-dimensional data is an issue of fundamental\nimportance in data science, molecular simulations and beyond. Several\napproaches work on the assumption that the important content of a dataset\nbelongs to a manifold whose Intrinsic Dimension (ID) is much lower than the\ncrude large number of coordinates. Such manifold is generally twisted and\ncurved, in addition points on it will be non-uniformly distributed: two factors\nthat make the identification of the ID and its exploitation really hard. Here\nwe propose a new ID estimator using only the distance of the first and the\nsecond nearest neighbor of each point in the sample. This extreme minimality\nenables us to reduce the effects of curvature, of density variation, and the\nresulting computational cost. The ID estimator is theoretically exact in\nuniformly distributed datasets, and provides consistent measures in general.\nWhen used in combination with block analysis, it allows discriminating the\nrelevant dimensions as a function of the block size. This allows estimating the\nID even when the data lie on a manifold perturbed by a high-dimensional noise,\na situation often encountered in real world data sets. We demonstrate the\nusefulness of the approach on molecular simulations and image analysis.\n","authors":"Elena Facco|Maria d'Errico|Alex Rodriguez|Alessandro Laio","affiliations":"","link_abstract":"http://arxiv.org/abs/1803.06992v1","link_pdf":"http://arxiv.org/pdf/1803.06992v1","link_doi":"http://dx.doi.org/10.1038/s41598-017-11873-y","comment":"Scientific Reports 2017","journal_ref":"","doi":"10.1038/s41598-017-11873-y","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1803.07828v2","submitted":"2018-03-21 10:06:28","updated":"2018-11-09 14:26:16","title":"Expeditious Generation of Knowledge Graph Embeddings","abstract":" Knowledge Graph Embedding methods aim at representing entities and relations\nin a knowledge base as points or vectors in a continuous vector space. Several\napproaches using embeddings have shown promising results on tasks such as link\nprediction, entity recommendation, question answering, and triplet\nclassification. However, only a few methods can compute low-dimensional\nembeddings of very large knowledge bases without needing state-of-the-art\ncomputational resources. In this paper, we propose KG2Vec, a simple and fast\napproach to Knowledge Graph Embedding based on the skip-gram model. Instead of\nusing a predefined scoring function, we learn it relying on Long Short-Term\nMemories. We show that our embeddings achieve results comparable with the most\nscalable approaches on knowledge graph completion as well as on a new metric.\nYet, KG2Vec can embed large graphs in lesser time by processing more than 250\nmillion triples in less than 7 hours on common hardware.\n","authors":"Tommaso Soru|Stefano Ruberto|Diego Moussallem|André Valdestilhas|Alexander Bigerl|Edgard Marx|Diego Esteves","affiliations":"","link_abstract":"http://arxiv.org/abs/1803.07828v2","link_pdf":"http://arxiv.org/pdf/1803.07828v2","link_doi":"","comment":"Submitted to the Archives of Data Science, Series A; 14 pages","journal_ref":"","doi":"","primary_category":"cs.CL","categories":"cs.CL|cs.AI|I.2.4; I.2.6"} {"id":"1803.08010v2","submitted":"2018-03-21 16:53:19","updated":"2018-04-04 01:44:22","title":"Social Media Would Not Lie: Prediction of the 2016 Taiwan Election via\n Online Heterogeneous Data","abstract":" The prevalence of online media has attracted researchers from various domains\nto explore human behavior and make interesting predictions. In this research,\nwe leverage heterogeneous social media data collected from various online\nplatforms to predict Taiwan's 2016 presidential election. In contrast to most\nexisting research, we take a \"signal\" view of heterogeneous information and\nadopt the Kalman filter to fuse multiple signals into daily vote predictions\nfor the candidates. We also consider events that influenced the election in a\nquantitative manner based on the so-called event study model that originated in\nthe field of financial research. We obtained the following interesting\nfindings. First, public opinions in online media dominate traditional polls in\nTaiwan election prediction in terms of both predictive power and timeliness.\nBut offline polls can still function on alleviating the sample bias of online\nopinions. Second, although online signals converge as election day approaches,\nthe simple Facebook \"Like\" is consistently the strongest indicator of the\nelection result. Third, most influential events have a strong connection to\ncross-strait relations, and the Chou Tzu-yu flag incident followed by the\napology video one day before the election increased the vote share of Tsai\nIng-Wen by 3.66%. This research justifies the predictive power of online media\nin politics and the advantages of information fusion. The combined use of the\nKalman filter and the event study method contributes to the data-driven\npolitical analytics paradigm for both prediction and attribution purposes.\n","authors":"Zheng Xie|Guannan Liu|Junjie Wu|Yong Tan","affiliations":"","link_abstract":"http://arxiv.org/abs/1803.08010v2","link_pdf":"http://arxiv.org/pdf/1803.08010v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-018-0163-7","comment":"","journal_ref":"EPJ Data Science,2018,7:32","doi":"10.1140/epjds/s13688-018-0163-7","primary_category":"cs.SI","categories":"cs.SI|physics.soc-ph|stat.AP|stat.ML"} {"id":"1803.08450v2","submitted":"2018-03-22 16:46:39","updated":"2019-02-13 13:40:05","title":"A Comprehensive Analysis of Deep Regression","abstract":" Deep learning revolutionized data science, and recently its popularity has\ngrown exponentially, as did the amount of papers employing deep networks.\nVision tasks, such as human pose estimation, did not escape from this trend.\nThere is a large number of deep models, where small changes in the network\narchitecture, or in the data pre-processing, together with the stochastic\nnature of the optimization procedures, produce notably different results,\nmaking extremely difficult to sift methods that significantly outperform\nothers. This situation motivates the current study, in which we perform a\nsystematic evaluation and statistical analysis of vanilla deep regression, i.e.\nconvolutional neural networks with a linear regression top layer. This is the\nfirst comprehensive analysis of deep regression techniques. We perform\nexperiments on four vision problems, and report confidence intervals for the\nmedian performance as well as the statistical significance of the results, if\nany. Surprisingly, the variability due to different data pre-processing\nprocedures generally eclipses the variability due to modifications in the\nnetwork architecture. Our results reinforce the hypothesis according to which,\nin general, a general-purpose network (e.g. VGG-16 or ResNet-50) adequately\ntuned can yield results close to the state-of-the-art without having to resort\nto more complex and ad-hoc regression models.\n","authors":"Stéphane Lathuilière|Pablo Mesejo|Xavier Alameda-Pineda|Radu Horaud","affiliations":"","link_abstract":"http://arxiv.org/abs/1803.08450v2","link_pdf":"http://arxiv.org/pdf/1803.08450v2","link_doi":"","comment":"submitted to TPAMI","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1803.09875v1","submitted":"2018-03-27 03:23:19","updated":"2018-03-27 03:23:19","title":"A Web Scraping Methodology for Bypassing Twitter API Restrictions","abstract":" Retrieving information from social networks is the first and primordial step\nmany data analysis fields such as Natural Language Processing, Sentiment\nAnalysis and Machine Learning. Important data science tasks relay on historical\ndata gathering for further predictive results. Most of the recent works use\nTwitter API, a public platform for collecting public streams of information,\nwhich allows querying chronological tweets for no more than three weeks old. In\nthis paper, we present a new methodology for collecting historical tweets\nwithin any date range using web scraping techniques bypassing for Twitter API\nrestrictions.\n","authors":"A. Hernandez-Suarez|G. Sanchez-Perez|K. Toscano-Medina|V. Martinez-Hernandez|V. Sanchez|H. Perez-Meana","affiliations":"","link_abstract":"http://arxiv.org/abs/1803.09875v1","link_pdf":"http://arxiv.org/pdf/1803.09875v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.IR","categories":"cs.IR"} {"id":"1803.10045v1","submitted":"2018-03-27 12:46:57","updated":"2018-03-27 12:46:57","title":"Kinetic Compressive Sensing","abstract":" Parametric images provide insight into the spatial distribution of\nphysiological parameters, but they are often extremely noisy, due to low SNR of\ntomographic data. Direct estimation from projections allows accurate noise\nmodeling, improving the results of post-reconstruction fitting. We propose a\nmethod, which we name kinetic compressive sensing (KCS), based on a\nhierarchical Bayesian model and on a novel reconstruction algorithm, that\nencodes sparsity of kinetic parameters. Parametric maps are reconstructed by\nmaximizing the joint probability, with an Iterated Conditional Modes (ICM)\napproach, alternating the optimization of activity time series (OS-MAP-OSL),\nand kinetic parameters (MAP-LM). We evaluated the proposed algorithm on a\nsimulated dynamic phantom: a bias/variance study confirmed how direct estimates\ncan improve the quality of parametric maps over a post-reconstruction fitting,\nand showed how the novel sparsity prior can further reduce their variance,\nwithout affecting bias. Real FDG PET human brain data (Siemens mMR, 40min)\nimages were also processed. Results enforced how the proposed KCS-regularized\ndirect method can produce spatially coherent images and parametric maps, with\nlower spatial noise and better tissue contrast. A GPU-based open source\nimplementation of the algorithm is provided.\n","authors":"Michele Scipioni|Maria F. Santarelli|Luigi Landini|Ciprian Catana|Douglas N. Greve|Julie C. Price|Stefano Pedemonte","affiliations":"DII, University of Pisa|IFC-CNR, Pisa|DII, University of Pisa|Martinos Center for Biomedical Imaging, Boston|Martinos Center for Biomedical Imaging, Boston|Martinos Center for Biomedical Imaging, Boston|Martinos Center for Biomedical Imaging, Boston","link_abstract":"http://arxiv.org/abs/1803.10045v1","link_pdf":"http://arxiv.org/pdf/1803.10045v1","link_doi":"","comment":"5 pages, 6 figures, Submitted to the Conference Record of \"IEEE\n Nuclear Science Symposium and Medical Imaging Conference (IEEE NSS-MIC) 2017\"","journal_ref":"","doi":"","primary_category":"physics.med-ph","categories":"physics.med-ph|cs.CV|physics.data-an|stat.ML"} {"id":"1803.10836v1","submitted":"2018-03-28 20:13:18","updated":"2018-03-28 20:13:18","title":"Technical Report: On the Usability of Hadoop MapReduce, Apache Spark &\n Apache Flink for Data Science","abstract":" Distributed data processing platforms for cloud computing are important tools\nfor large-scale data analytics. Apache Hadoop MapReduce has become the de facto\nstandard in this space, though its programming interface is relatively\nlow-level, requiring many implementation steps even for simple analysis tasks.\nThis has led to the development of advanced dataflow oriented platforms, most\nprominently Apache Spark and Apache Flink. Those platforms not only aim to\nimprove performance through improved in-memory processing, but in particular\nprovide built-in high-level data processing functionality, such as filtering\nand join operators, which should make data analysis tasks easier to develop\nthan with plain Hadoop MapReduce. But is this indeed the case?\n This paper compares three prominent distributed data processing platforms:\nApache Hadoop MapReduce; Apache Spark; and Apache Flink, from a usability\nperspective. We report on the design, execution and results of a usability\nstudy with a cohort of masters students, who were learning and working with all\nthree platforms in order to solve different use cases set in a data science\ncontext. Our findings show that Spark and Flink are preferred platforms over\nMapReduce. Among participants, there was no significant difference in perceived\npreference or development time between both Spark and Flink as platforms for\nbatch-oriented big data analysis. This study starts an exploration of the\nfactors that make big data platforms more - or less - effective for users in\ndata science.\n","authors":"Bilal Akil|Ying Zhou|Uwe Röhm","affiliations":"","link_abstract":"http://arxiv.org/abs/1803.10836v1","link_pdf":"http://arxiv.org/pdf/1803.10836v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DC","categories":"cs.DC"} {"id":"1804.00180v1","submitted":"2018-03-31 14:51:08","updated":"2018-03-31 14:51:08","title":"Efficient Sparse Code Multiple Access Decoder Based on Deterministic\n Message Passing Algorithm","abstract":" Being an effective non-orthogonal multiple access (NOMA) technique, sparse\ncode multiple access (SCMA) is promising for future wireless communication.\nCompared with orthogonal techniques, SCMA enjoys higher overloading tolerance\nand lower complexity because of its sparsity. In this paper, based on\ndeterministic message passing algorithm (DMPA), algorithmic simplifications\nsuch as domain changing and probability approximation are applied for SCMA\ndecoding. Early termination, adaptive decoding, and initial noise reduction are\nalso employed for faster convergence and better performance. Numerical results\nshow that the proposed optimizations benefit both decoding complexity and\nspeed. Furthermore, efficient hardware architectures based on folding and\nretiming are proposed. VLSI implementation is also given in this paper.\nComparison with the state-of-the-art have shown the proposed decoder's\nadvantages in both latency and throughput (multi-Gbps).\n","authors":"Chuan Zhang|Chao Yang|Wei Xu|Shunqing Zhang|Zaichen Zhang|Xiaohu You","affiliations":"Lab of Efficient Architectures for Digital-communication and Signal-processing|Lab of Efficient Architectures for Digital-communication and Signal-processing|National Mobile Communications Research Laboratory|Shanghai Institute for Advanced Communications and Data Science, Shanghai University, Shanghai, China|Quantum Information Center, Southeast University, China|National Mobile Communications Research Laboratory","link_abstract":"http://arxiv.org/abs/1804.00180v1","link_pdf":"http://arxiv.org/pdf/1804.00180v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.IT","categories":"cs.IT|cs.AR|eess.SP|math.IT"} {"id":"1804.03565v1","submitted":"2018-04-03 19:44:57","updated":"2018-04-03 19:44:57","title":"Predicting Gross Movie Revenue","abstract":" 'There is no terror in the bang, only is the anticipation of it' - Alfred\nHitchcock.\n Yet there is everything in correctly anticipating the bang a movie would make\nin the box-office. Movies make a high profile, billion dollar industry and\nprediction of movie revenue can be very lucrative. Predicted revenues can be\nused for planning both the production and distribution stages. For example,\nprojected gross revenue can be used to plan the remuneration of the actors and\ncrew members as well as other parts of the budget [1].\n Success or failure of a movie can depend on many factors: star-power, release\ndate, budget, MPAA (Motion Picture Association of America) rating, plot and the\nhighly unpredictable human reactions. The enormity of the number of exogenous\nvariables makes manual revenue prediction process extremely difficult. However,\nin the era of computer and data sciences, volumes of data can be efficiently\nprocessed and modelled. Hence the tough job of predicting gross revenue of a\nmovie can be simplified with the help of modern computing power and the\nhistorical data available as movie databases [2].\n","authors":"Sharmistha Dey","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.03565v1","link_pdf":"http://arxiv.org/pdf/1804.03565v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|stat.ML"} {"id":"1804.01901v2","submitted":"2018-04-05 15:12:33","updated":"2019-04-11 21:38:42","title":"Towards radiologist-level cancer risk assessment in CT lung screening\n using deep learning","abstract":" Importance: Lung cancer is the leading cause of cancer mortality in the US,\nresponsible for more deaths than breast, prostate, colon and pancreas cancer\ncombined and it has been recently demonstrated that low-dose computed\ntomography (CT) screening of the chest can significantly reduce this death\nrate.\n Objective: To compare the performance of a deep learning model to\nstate-of-the-art automated algorithms and radiologists as well as assessing the\nrobustness of the algorithm in heterogeneous datasets.\n Design, Setting, and Participants: Three low-dose CT lung cancer screening\ndatasets from heterogeneous sources were used, including National Lung\nScreening Trial (NLST, n=3410), Lahey Hospital and Medical Center (LHMC,\nn=3174) data, Kaggle competition data (from both stages, n=1595+505) and the\nUniversity of Chicago data (UCM, a subset of NLST, annotated by radiologists,\nn=197). Relevant works on automated methods for Lung Cancer malignancy\nestimation have used significantly less data in size and diversity. At the\nfirst stage, our framework employs a nodule detector; while in the second\nstage, we use both the image area around the nodules and nodule features as\ninputs to a neural network that estimates the malignancy risk for the entire CT\nscan. We trained our two-stage algorithm on a part of the NLST dataset, and\nvalidated it on the other datasets.\n Results, Conclusions, and Relevance: The proposed deep learning model: (a)\ngeneralizes well across all three data sets, achieving AUC between 86% to 94%;\n(b) has better performance than the widely accepted PanCan Risk Model,\nachieving 11% better AUC score; (c) has improved performance compared to the\nstate-of-the-art represented by the winners of the Kaggle Data Science Bowl\n2017 competition on lung cancer screening; (d) has comparable performance to\nradiologists in estimating cancer risk at a patient level.\n","authors":"Stojan Trajanovski|Dimitrios Mavroeidis|Christine Leon Swisher|Binyam Gebrekidan Gebre|Bastiaan S. Veeling|Rafael Wiemker|Tobias Klinder|Amir Tahmasebi|Shawn M. Regis|Christoph Wald|Brady J. McKee|Sebastian Flacke|Heber MacMahon|Homer Pien","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.01901v2","link_pdf":"http://arxiv.org/pdf/1804.01901v2","link_doi":"","comment":"Submitted for publication. 11 pages","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1804.01910v2","submitted":"2018-04-05 15:27:02","updated":"2018-08-21 09:59:59","title":"Multi-level Activation for Segmentation of Hierarchically-nested Classes","abstract":" For many biological image segmentation tasks, including topological\nknowledge, such as the nesting of classes, can greatly improve results.\nHowever, most `out-of-the-box' CNN models are still blind to such prior\ninformation. In this paper, we propose a novel approach to encode this\ninformation, through a multi-level activation layer and three compatible\nlosses. We benchmark all of them on nuclei segmentation in bright-field\nmicroscopy cell images from the 2018 Data Science Bowl challenge, offering an\nexemplary segmentation task with cells and nested subcellular structures. Our\nscheme greatly speeds up learning, and outperforms standard multi-class\nclassification with soft-max activation and a previously proposed method\nstemming from it, improving the Dice score significantly (p-values<0.007). Our\napproach is conceptually simple, easy to implement and can be integrated in any\nCNN architecture. It can be generalized to a higher number of classes, with or\nwithout further relations of containment.\n","authors":"Marie Piraud|Anjany Sekuboyina|Bjoern H. Menze","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.01910v2","link_pdf":"http://arxiv.org/pdf/1804.01910v2","link_doi":"","comment":"Accepted for the BioImage Computing 2018 workshop, ECCV conference","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1804.02663v1","submitted":"2018-04-08 09:50:52","updated":"2018-04-08 09:50:52","title":"Method of fractal diversity in data science problems","abstract":" The parameter (SNR) is obtained for distinguishing the Gaussian function, the\ndistribution of random variables in the absence of cross correlation, from\nother functions, which makes it possible to describe collective states with\nstrong cross-correlation of data. The signal-to-noise ratio (SNR) in\none-dimensional space is determined and a calculation algorithm based on the\nfractal variety of the Cantor dust in a closed loop is given. The algorithm is\ninvariant for linear transformations of the initial data set, has\nrenormalization-group invariance, and determines the intensity of\ncross-correlation (collective effect) of the data. The description of the\ncollective state is universal and does not depend on the nature of the\ncorrelation of data, nor is the universality of the distribution of random\nvariables in the absence of data correlation. The method is applicable for\nlarge sets of non-Gaussian or strange data obtained in information technology.\nIn confirming the hypothesis of Koshland, the application of the method to the\nintensity data of digital X-ray diffraction spectra with the calculation of the\ncollective effect makes it possible to identify a conformer exhibiting\nbiological activity.\n","authors":"Vitalii Vladimirov|Elena Vladimirova","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.02663v1","link_pdf":"http://arxiv.org/pdf/1804.02663v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"physics.data-an","categories":"physics.data-an"} {"id":"1804.02998v1","submitted":"2018-04-09 14:09:47","updated":"2018-04-09 14:09:47","title":"Anomaly Detection for Industrial Big Data","abstract":" As the Industrial Internet of Things (IIoT) grows, systems are increasingly\nbeing monitored by arrays of sensors returning time-series data at\never-increasing 'volume, velocity and variety' (i.e. Industrial Big Data). An\nobvious use for these data is real-time systems condition monitoring and\nprognostic time to failure analysis (remaining useful life, RUL). (e.g. See\nwhite papers by Senseye.io, and output of the NASA Prognostics Center of\nExcellence (PCoE).) However, as noted by Agrawal and Choudhary 'Our ability to\ncollect \"big data\" has greatly surpassed our capability to analyze it,\nunderscoring the emergence of the fourth paradigm of science, which is\ndata-driven discovery.' In order to fully utilize the potential of Industrial\nBig Data we need data-driven techniques that operate at scales that process\nmodels cannot. Here we present a prototype technique for data-driven anomaly\ndetection to operate at industrial scale. The method generalizes to application\nwith almost any multivariate dataset based on independent ordinations of\nrepeated (bootstrapped) partitions of the dataset and inspection of the joint\ndistribution of ordinal distances.\n","authors":"Neil Caithness|David Wallom","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.02998v1","link_pdf":"http://arxiv.org/pdf/1804.02998v1","link_doi":"http://dx.doi.org/10.5220/0006835502850293","comment":"9 pages; 11 figures","journal_ref":"In Proceedings of the 7th International Conference on Data\n Science, Technology and Applications - Volume 1: DATA (2018), ISBN\n 978-989-758-318-6, pages 285-293","doi":"10.5220/0006835502850293","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1804.03184v2","submitted":"2018-04-09 18:59:05","updated":"2018-06-07 21:04:17","title":"Adversarial Time-to-Event Modeling","abstract":" Modern health data science applications leverage abundant molecular and\nelectronic health data, providing opportunities for machine learning to build\nstatistical models to support clinical practice. Time-to-event analysis, also\ncalled survival analysis, stands as one of the most representative examples of\nsuch statistical models. We present a deep-network-based approach that\nleverages adversarial learning to address a key challenge in modern\ntime-to-event modeling: nonparametric estimation of event-time distributions.\nWe also introduce a principled cost function to exploit information from\ncensored events (events that occur subsequent to the observation window).\nUnlike most time-to-event models, we focus on the estimation of time-to-event\ndistributions, rather than time ordering. We validate our model on both\nbenchmark and real datasets, demonstrating that the proposed formulation yields\nsignificant performance gains relative to a parametric alternative, which we\nalso propose.\n","authors":"Paidamoyo Chapfuwa|Chenyang Tao|Chunyuan Li|Courtney Page|Benjamin Goldstein|Lawrence Carin|Ricardo Henao","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.03184v2","link_pdf":"http://arxiv.org/pdf/1804.03184v2","link_doi":"","comment":"Published in ICML 2018; Code:\n https://github.com/paidamoyo/adversarial_time_to_event","journal_ref":"Proceedings of the 35th International Conference on Machine\n Learning, PMLR 80:735-744, 2018","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1804.04457v2","submitted":"2018-04-12 12:13:34","updated":"2018-09-20 12:07:44","title":"Goal-based sensitivity maps using time windows and ensemble\n perturbations","abstract":" We present an approach for forming sensitivity maps (or sensitivites) using\nensembles. The method is an alternative to using an adjoint, which can be very\nchallenging to formulate and also computationally expensive to solve. The main\nnovelties of the presented approach are: 1) the use of goals, weighting the\nperturbation to help resolve the most important sensitivities, 2) the use of\ntime windows, which enable the perturbations to be optimised independently for\neach window and 3) re-orthogonalisation of the solution through time, which\nhelps optimise each perturbation when calculating sensitivity maps. These novel\nmethods greatly reduce the number of ensembles required to form the sensitivity\nmaps as demonstrated in this paper. As the presented method relies solely on\nensembles obtained from the forward model, it can therefore be applied directly\nto forward models of arbitrary complexity arising from, for example,\nmulti-physics coupling, legacy codes or model chains. It can also be applied to\ncompute sensitivities for optimisation of sensor placement, optimisation for\ndesign or control, goal-based mesh adaptivity, assessment of goals (e.g. hazard\nassessment and mitigation in the natural environment), determining the worth of\ncurrent data and data assimilation.\n We analyse and demonstrate the efficiency of the approach by applying the\nmethod to advection problems and also a non-linear heterogeneous multi-phase\nporous media problem, showing, in all cases, that the number of ensembles\nrequired to obtain accurate sensitivity maps is relatively low, in the order of\n10s.\n","authors":"C. E. Heaney|P. Salinas|F. Fang|C. C. Pain|I. M. Navon","affiliations":"Applied Modelling and Computation Group, Department of Earth Science and Engineering, Imperial College London, UK|Applied Modelling and Computation Group, Department of Earth Science and Engineering, Imperial College London, UK|Applied Modelling and Computation Group, Department of Earth Science and Engineering, Imperial College London, UK|Applied Modelling and Computation Group, Department of Earth Science and Engineering, Imperial College London, UK|Department of Scientific Computing, Florida State University, USA","link_abstract":"http://arxiv.org/abs/1804.04457v2","link_pdf":"http://arxiv.org/pdf/1804.04457v2","link_doi":"","comment":"35 pages, 13 figures. Submitted to JCP in September 2018 Changes:\n additional context given in the introduction, additional explanation given in\n section 2.2, some changes to equations. Results unchanged","journal_ref":"","doi":"","primary_category":"cs.CE","categories":"cs.CE|physics.comp-ph"} {"id":"1804.04791v1","submitted":"2018-04-13 05:35:19","updated":"2018-04-13 05:35:19","title":"Fast, Parameter free Outlier Identification for Robust PCA","abstract":" Robust PCA, the problem of PCA in the presence of outliers has been\nextensively investigated in the last few years. Here we focus on Robust PCA in\nthe column sparse outlier model. The existing methods for column sparse outlier\nmodel assumes either the knowledge of the dimension of the lower dimensional\nsubspace or the fraction of outliers in the system. However in many\napplications knowledge of these parameters is not available. Motivated by this\nwe propose a parameter free outlier identification method for robust PCA which\na) does not require the knowledge of outlier fraction, b) does not require the\nknowledge of the dimension of the underlying subspace, c) is computationally\nsimple and fast. Further, analytical guarantees are derived for outlier\nidentification and the performance of the algorithm is compared with the\nexisting state of the art methods.\n","authors":"Vishnu Menon|Sheetal Kalyani","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.04791v1","link_pdf":"http://arxiv.org/pdf/1804.04791v1","link_doi":"","comment":"13 pages. Submitted to IEEE JSTSP Special Issue on Data Science:\n Robust Subspace Learning and Tracking: Theory, Algorithms, and Applications","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1804.05464v3","submitted":"2018-04-16 01:14:17","updated":"2020-02-20 18:26:35","title":"On Gradient-Based Learning in Continuous Games","abstract":" We formulate a general framework for competitive gradient-based learning that\nencompasses a wide breadth of multi-agent learning algorithms, and analyze the\nlimiting behavior of competitive gradient-based learning algorithms using\ndynamical systems theory. For both general-sum and potential games, we\ncharacterize a non-negligible subset of the local Nash equilibria that will be\navoided if each agent employs a gradient-based learning algorithm. We also shed\nlight on the issue of convergence to non-Nash strategies in general- and\nzero-sum games, which may have no relevance to the underlying game, and arise\nsolely due to the choice of algorithm. The existence and frequency of such\nstrategies may explain some of the difficulties encountered when using gradient\ndescent in zero-sum games as, e.g., in the training of generative adversarial\nnetworks. To reinforce the theoretical contributions, we provide empirical\nresults that highlight the frequency of linear quadratic dynamic games (a\nbenchmark for multi-agent reinforcement learning) that admit global Nash\nequilibria that are almost surely avoided by policy gradient.\n","authors":"Eric Mazumdar|Lillian J. Ratliff|S. Shankar Sastry","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.05464v3","link_pdf":"http://arxiv.org/pdf/1804.05464v3","link_doi":"http://dx.doi.org/10.1137/18M1231298","comment":"","journal_ref":"SIAM Journal on Mathematics of Data Science 2020 2:1, 103-131","doi":"10.1137/18M1231298","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1804.07481v1","submitted":"2018-04-20 08:03:52","updated":"2018-04-20 08:03:52","title":"Streaming Active Learning Strategies for Real-Life Credit Card Fraud\n Detection: Assessment and Visualization","abstract":" Credit card fraud detection is a very challenging problem because of the\nspecific nature of transaction data and the labeling process. The transaction\ndata is peculiar because they are obtained in a streaming fashion, they are\nstrongly imbalanced and prone to non-stationarity. The labeling is the outcome\nof an active learning process, as every day human investigators contact only a\nsmall number of cardholders (associated to the riskiest transactions) and\nobtain the class (fraud or genuine) of the related transactions. An adequate\nselection of the set of cardholders is therefore crucial for an efficient fraud\ndetection process. In this paper, we present a number of active learning\nstrategies and we investigate their fraud detection accuracies. We compare\ndifferent criteria (supervised, semi-supervised and unsupervised) to query\nunlabeled transactions. Finally, we highlight the existence of an\nexploitation/exploration trade-off for active learning in the context of fraud\ndetection, which has so far been overlooked in the literature.\n","authors":"Fabirzio Carcillo|Yann-Aël Le Borgne|Olivier Caelen|Gianluca Bontempi","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.07481v1","link_pdf":"http://arxiv.org/pdf/1804.07481v1","link_doi":"http://dx.doi.org/10.1007/s41060-018-0116-z","comment":"","journal_ref":"International Journal of Data Science and Analytics 2018","doi":"10.1007/s41060-018-0116-z","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1804.07795v3","submitted":"2018-04-20 18:52:52","updated":"2018-05-26 00:29:34","title":"Stochastic subgradient method converges on tame functions","abstract":" This work considers the question: what convergence guarantees does the\nstochastic subgradient method have in the absence of smoothness and convexity?\nWe prove that the stochastic subgradient method, on any semialgebraic locally\nLipschitz function, produces limit points that are all first-order stationary.\nMore generally, our result applies to any function with a Whitney stratifiable\ngraph. In particular, this work endows the stochastic subgradient method, and\nits proximal extension, with rigorous convergence guarantees for a wide class\nof problems arising in data science---including all popular deep learning\narchitectures.\n","authors":"Damek Davis|Dmitriy Drusvyatskiy|Sham Kakade|Jason D. Lee","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.07795v3","link_pdf":"http://arxiv.org/pdf/1804.07795v3","link_doi":"","comment":"32 pages, 1 figure","journal_ref":"","doi":"","primary_category":"math.OC","categories":"math.OC|cs.LG|65K05, 65K10, 90C15, 90C30"} {"id":"1804.08133v2","submitted":"2018-04-22 16:33:56","updated":"2018-04-24 12:59:51","title":"SolidWorx: A Resilient and Trustworthy Transactive Platform for Smart\n and Connected Communities","abstract":" Internet of Things and data sciences are fueling the development of\ninnovative solutions for various applications in Smart and Connected\nCommunities (SCC). These applications provide participants with the capability\nto exchange not only data but also resources, which raises the concerns of\nintegrity, trust, and above all the need for fair and optimal solutions to the\nproblem of resource allocation. This exchange of information and resources\nleads to a problem where the stakeholders of the system may have limited trust\nin each other. Thus, collaboratively reaching consensus on when, how, and who\nshould access certain resources becomes problematic. This paper presents\nSolidWorx, a blockchain-based platform that provides key mechanisms required\nfor arbitrating resource consumption across different SCC applications in a\ndomain-agnostic manner. For example, it introduces and implements a\nhybrid-solver pattern, where complex optimization computation is handled\noff-blockchain while solution validation is performed by a smart contract. To\nensure correctness, the smart contract of SolidWorx is generated and verified.\n","authors":"Scott Eisele|Aron Laszka|Anastasia Mavridou|Abhishek Dubey","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.08133v2","link_pdf":"http://arxiv.org/pdf/1804.08133v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DC","categories":"cs.DC"} {"id":"1804.08170v1","submitted":"2018-04-22 21:00:28","updated":"2018-04-22 21:00:28","title":"A Deep Convolutional Neural Network for Lung Cancer Diagnostic","abstract":" In this paper, we examine the strength of deep learning technique for\ndiagnosing lung cancer on medical image analysis problem. Convolutional neural\nnetworks (CNNs) models become popular among the pattern recognition and\ncomputer vision research area because of their promising outcome on generating\nhigh-level image representations. We propose a new deep learning architecture\nfor learning high-level image representation to achieve high classification\naccuracy with low variance in medical image binary classification tasks. We aim\nto learn discriminant compact features at beginning of our deep convolutional\nneural network. We evaluate our model on Kaggle Data Science Bowl 2017 (KDSB17)\ndata set, and compare it with some related works proposed in the Kaggle\ncompetition.\n","authors":"Mehdi Fatan Serj|Bahram Lavi|Gabriela Hoff|Domenec Puig Valls","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.08170v1","link_pdf":"http://arxiv.org/pdf/1804.08170v1","link_doi":"","comment":"10 pages, 5 figures, 2 tables","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1804.08685v3","submitted":"2018-04-23 19:59:51","updated":"2018-09-07 15:38:04","title":"Crawling in Rogue's dungeons with (partitioned) A3C","abstract":" Rogue is a famous dungeon-crawling video-game of the 80ies, the ancestor of\nits gender. Rogue-like games are known for the necessity to explore partially\nobservable and always different randomly-generated labyrinths, preventing any\nform of level replay. As such, they serve as a very natural and challenging\ntask for reinforcement learning, requiring the acquisition of complex,\nnon-reactive behaviors involving memory and planning. In this article we show\nhow, exploiting a version of A3C partitioned on different situations, the agent\nis able to reach the stairs and descend to the next level in 98% of cases.\n","authors":"Andrea Asperti|Daniele Cortesi|Francesco Sovrano","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.08685v3","link_pdf":"http://arxiv.org/pdf/1804.08685v3","link_doi":"http://dx.doi.org/10.1007/978-3-030-13709-0_22","comment":"Accepted at the Fourth International Conference on Machine Learning,\n Optimization, and Data Science (LOD 2018)","journal_ref":"","doi":"10.1007/978-3-030-13709-0_22","primary_category":"cs.LG","categories":"cs.LG|stat.ML|I.2.6"} {"id":"1804.08939v1","submitted":"2018-04-24 10:07:02","updated":"2018-04-24 10:07:02","title":"Building a scalable python distribution for HEP data analysis","abstract":" There are numerous approaches to building analysis applications across the\nhigh-energy physics community. Among them are Python-based, or at least\nPython-driven, analysis workflows. We aim to ease the adoption of a\nPython-based analysis toolkit by making it easier for non-expert users to gain\naccess to Python tools for scientific analysis. Experimental software\ndistributions and individual user analysis have quite different requirements.\nDistributions tend to worry most about stability, usability and\nreproducibility, while the users usually strive to be fast and nimble. We\ndiscuss how we built and now maintain a python distribution for analysis while\nsatisfying requirements both a large software distribution (in our case, that\nof CMSSW) and user, or laptop, level analysis. We pursued the integration of\ntools used by the broader data science community as well as HEP developed\n(e.g., histogrammar, root_numpy) Python packages. We discuss concepts we\ninvestigated for package integration and testing, as well as issues we\nencountered through this process. Distribution and platform support are\nimportant topics. We discuss our approach and progress towards a sustainable\ninfrastructure for supporting this Python stack for the CMS user community and\nfor the broader HEP user community.\n","authors":"David Lange","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.08939v1","link_pdf":"http://arxiv.org/pdf/1804.08939v1","link_doi":"","comment":"Proceedings of 18th International Workshop on Advanced Computing and\n Analysis Techniques in Physics Research (ACAT)","journal_ref":"","doi":"","primary_category":"physics.comp-ph","categories":"physics.comp-ph"} {"id":"1804.08980v1","submitted":"2018-04-24 12:17:01","updated":"2018-04-24 12:17:01","title":"Rate-Distortion Theory for General Sets and Measures","abstract":" This paper is concerned with a rate-distortion theory for sequences of i.i.d.\nrandom variables with general distribution supported on general sets including\nmanifolds and fractal sets. Manifold structures are prevalent in data science,\ne.g., in compressed sensing, machine learning, image processing, and\nhandwritten digit recognition. Fractal sets find application in image\ncompression and in modeling of Ethernet traffic. We derive a lower bound on the\n(single-letter) rate-distortion function that applies to random variables X of\ngeneral distribution and for continuous X reduces to the classical Shannon\nlower bound. Moreover, our lower bound is explicit up to a parameter obtained\nby solving a convex optimization problem in a nonnegative real variable. The\nonly requirement for the bound to apply is the existence of a sigma-finite\nreference measure for X satisfying a certain subregularity condition. This\ncondition is very general and prevents the reference measure from being highly\nconcentrated on balls of small radii. To illustrate the wide applicability of\nour result, we evaluate the lower bound for a random variable distributed\nuniformly on a manifold, namely, the unit circle, and a random variable\ndistributed uniformly on a self-similar set, namely, the middle third Cantor\nset.\n","authors":"Erwin Riegler|Günther Koliander|Helmut Bölcskei","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.08980v1","link_pdf":"http://arxiv.org/pdf/1804.08980v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.IT","categories":"cs.IT|math.IT"} {"id":"1804.10846v6","submitted":"2018-04-28 20:23:45","updated":"2019-04-07 03:10:51","title":"Data science is science's second chance to get causal inference right: A\n classification of data science tasks","abstract":" Causal inference from observational data is the goal of many data analyses in\nthe health and social sciences. However, academic statistics has often frowned\nupon data analyses with a causal objective. The introduction of the term \"data\nscience\" provides a historic opportunity to redefine data analysis in such a\nway that it naturally accommodates causal inference from observational data.\nLike others before, we organize the scientific contributions of data science\ninto three classes of tasks: Description, prediction, and counterfactual\nprediction (which includes causal inference). An explicit classification of\ndata science tasks is necessary to discuss the data, assumptions, and analytics\nrequired to successfully accomplish each task. We argue that a failure to\nadequately describe the role of subject-matter expert knowledge in data\nanalysis is a source of widespread misunderstandings about data science.\nSpecifically, causal analyses typically require not only good data and\nalgorithms, but also domain expert knowledge. We discuss the implications for\nthe use of data science to guide decision-making in the real world and to train\ndata scientists.\n","authors":"Miguel A. Hernán|John Hsu|Brian Healy","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.10846v6","link_pdf":"http://arxiv.org/pdf/1804.10846v6","link_doi":"http://dx.doi.org/10.1080/09332480.2019.1579578","comment":"","journal_ref":"Chance 32(1):42-49 (2019)","doi":"10.1080/09332480.2019.1579578","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1804.11174v2","submitted":"2018-04-30 13:18:08","updated":"2018-11-13 07:04:05","title":"Reducing Noise for PIC Simulations Using Kernel Density Estimation\n Algorithm","abstract":" Noise is a major concern for Particle-In-Cell (PIC) simulations. We propose a\nnew theoretical and algorithmic framework to evaluate and reduce the noise\nlevel for PIC simulations based on the Kernel Density Estimation (KDE) theory,\nwhich has been widely adopted in machine learning and big data science.\nAccording to this framework, the error on particle density estimation for PIC\nsimulations can be characterized by the Mean Integrated Square Error (MISE),\nwhich consists of two parts, systematic error and noise. A careful analysis\nshows that in the standard PIC methods noise is the dominate error, and the\nnoise level can be reduced if we select different shape functions that are\ncapable of balancing the systematic error and the noise. To improve\nperformance, we use the von Mises distribution as the shape function and seek\nan optimal particle width that minimizes the MISE, represented by a\nCross-Validation (CV) function. This procedure significantly reduces both the\nnoise and the MISE for PIC simulations. A particle-wise width adjustment\nalgorithm and a width update algorithm are further developed to reduce the\nMISE. Simulations using the examples of Langmuir wave and Landau Damping\ndemonstrate that the KDE algorithm developed in the present study reduces the\nnoise level on density estimation by 98%, and gives a much more accurate result\non the linear damping rate compared to the standard PIC methods. Meanwhile, it\nis computational efficient that can save 40% time to achieve the same accuracy.\n","authors":"Wentao Wu|Hong Qin","affiliations":"","link_abstract":"http://arxiv.org/abs/1804.11174v2","link_pdf":"http://arxiv.org/pdf/1804.11174v2","link_doi":"http://dx.doi.org/10.1063/1.5038039","comment":"28 pages, 8 figures","journal_ref":"W. Wu, H. Qin, Physics of Plasmas 25, 102107 (2018)","doi":"10.1063/1.5038039","primary_category":"physics.comp-ph","categories":"physics.comp-ph|physics.plasm-ph"} {"id":"1805.00471v1","submitted":"2018-05-01 05:24:40","updated":"2018-05-01 05:24:40","title":"\"I ain't tellin' white folks nuthin\": A quantitative exploration of the\n race-related problem of candour in the WPA slave narratives","abstract":" From 1936-38, the Works Progress Administration interviewed thousands of\nformer slaves about their life experiences. While these interviews are crucial\nto understanding the \"peculiar institution\" from the standpoint of the slave\nhimself, issues relating to bias cloud analyses of these interviews. The\nproblem I investigate is the problem of candour in the WPA slave narratives: it\nis widely held in the historical community that the strict racial caste system\nof the Deep South compelled black ex-slaves to tell white interviewers what\nthey thought they wanted to hear, suggesting that there was a significant\ndifference candour depending on whether their interviewer was white or black.\nIn this work, I attempt to quantitatively characterise this race-related\nproblem of candour. Prior work has either been of an impressionistic,\nqualitative nature, or utilised exceedingly simple quantitative methodology. In\ncontrast, I use more sophisticated statistical methods: in particular word\nfrequency and sentiment analysis and comparative topic modelling with LDA to\ntry and identify differences in the content and sentiment expressed by\nex-slaves in front of white interviewers versus black interviewers. While my\nsentiment analysis methodology was ultimately unsuccessful due to the\ncomplexity of the task, my word frequency analysis and comparative topic\nmodelling methods both showed strong evidence that the content expressed in\nfront of white interviewers was different from that of black interviewers. In\nparticular, I found that the ex-slaves spoke much more about unfavourable\naspects of slavery like whipping and slave patrollers in front of interviewers\nof their own race. I hope that my more-sophisticated statistical methodology\nhelps improve the robustness of the argument for the existence of this problem\nof candour in the slave narratives, which some would seek to deny for\nrevisionist purposes.\n","authors":"Soumya Kambhampati","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.00471v1","link_pdf":"http://arxiv.org/pdf/1805.00471v1","link_doi":"","comment":"A thesis presented in partial fulfilment of the requirements of the\n degree of Bachelor of Arts in Statistics & Data Science at Yale University","journal_ref":"","doi":"","primary_category":"cs.CL","categories":"cs.CL"} {"id":"1805.05401v1","submitted":"2018-05-04 12:28:03","updated":"2018-05-04 12:28:03","title":"Building Data Science Capabilities into University Data Warehouse to\n Predict Graduation","abstract":" The discipline of data science emerged to combine statistical methods with\ncomputing. At Aalto University, Finland, we have taken first steps to bring\neducational data science as a part of daily operations of Management\nInformation Services. This required changes in IT environment: we enhanced data\nwarehouse infrastructure with a data science lab, where we can read predictive\nmodel training data from data warehouse database and use the created predictive\nmodels in database queries. We then conducted a data science pilot with an\nobjective to predict students' graduation probability and time-to-degree with\nstudent registry data. Further ethical and legal considerations are needed\nbefore using predictions in daily operations of the university.\n","authors":"Joonas Pesonen|Anna Fomkin|Lauri Jokipii","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.05401v1","link_pdf":"http://arxiv.org/pdf/1805.05401v1","link_doi":"","comment":"EUNIS 2018","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1805.08694v2","submitted":"2018-05-06 18:14:51","updated":"2018-07-17 21:05:33","title":"Image Based Fashion Product Recommendation with Deep Learning","abstract":" We develop a two-stage deep learning framework that recommends fashion images\nbased on other input images of similar style. For that purpose, a neural\nnetwork classifier is used as a data-driven, visually-aware feature extractor.\nThe latter then serves as input for similarity-based recommendations using a\nranking algorithm. Our approach is tested on the publicly available Fashion\ndataset. Initialization strategies using transfer learning from larger product\ndatabases are presented. Combined with more traditional content-based\nrecommendation systems, our framework can help to increase robustness and\nperformance, for example, by better matching a particular customer style.\n","authors":"Hessel Tuinhof|Clemens Pirker|Markus Haltmeier","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.08694v2","link_pdf":"http://arxiv.org/pdf/1805.08694v2","link_doi":"http://dx.doi.org/10.1007/978-3-030-13709-0_40","comment":"","journal_ref":"LOD: International Conference on Machine Learning, Optimization,\n and Data Science Machine Learning, Optimization, and Data Science 4th\n International Conference, LOD 2018, Volterra, Italy, September 13-16, 2018,\n Revised Selected Papers","doi":"10.1007/978-3-030-13709-0_40","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1805.03735v2","submitted":"2018-05-09 21:14:17","updated":"2018-05-14 14:37:54","title":"Sequence Aggregation Rules for Anomaly Detection in Computer Network\n Traffic","abstract":" We evaluate methods for applying unsupervised anomaly detection to\ncybersecurity applications on computer network traffic data, or flow. We borrow\nfrom the natural language processing literature and conceptualize flow as a\nsort of \"language\" spoken between machines. Five sequence aggregation rules are\nevaluated for their efficacy in flagging multiple attack types in a labeled\nflow dataset, CICIDS2017. For sequence modeling, we rely on long short-term\nmemory (LSTM) recurrent neural networks (RNN). Additionally, a simple\nfrequency-based model is described and its performance with respect to attack\ndetection is compared to the LSTM models. We conclude that the frequency-based\nmodel tends to perform as well as or better than the LSTM models for the tasks\nat hand, with a few notable exceptions.\n","authors":"Benjamin J. Radford|Bartley D. Richardson|Shawn E. Davis","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.03735v2","link_pdf":"http://arxiv.org/pdf/1805.03735v2","link_doi":"","comment":"Prepared for the American Statistical Associations Symposium on Data\n Science and Statistics 2018","journal_ref":"","doi":"","primary_category":"cs.CR","categories":"cs.CR|cs.CY|cs.LG|stat.AP"} {"id":"1805.05052v11","submitted":"2018-05-14 08:08:33","updated":"2019-05-19 17:55:22","title":"Machine Learning: Basic Principles","abstract":" This tutorial introduces some main concepts of machine learning (ML). From an\nengineering point of view, the field of ML revolves around developing software\nthat implements the scientific principle: (i) formulate a hypothesis (choose a\nmodel) about some phenomenon, (ii) collect data to test the hypothesis\n(validate the model) and (iii) refine the hypothesis (iterate). One important\nclass of algorithms based on this principle are gradient descent methods which\naim at iteratively refining a model which is parametrized by some (``weight'')\nvector. A plethora of ML methods is obtained by combining different choices for\nthe hypothesis space (model), the quality measure (loss) and the computational\nimplementation of the model refinement (optimization method). %Many of the\ncurrent systems, which are considered as (artificially) intelligent, are based\non %combinations of few basic machine learning methods. After formalizing the\nmain building blocks of an ML problem, some popular algorithmic design patterns\nfor ML methods are discussed. This tutorial grew out of the lecture notes\ndeveloped for the courses ``Machine Learning: Basic Principles'' and\n``Artificial Intelligence'', which I have co-taught since 2015 at Aalto\nUniversity.\n","authors":"Alexander Jung","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.05052v11","link_pdf":"http://arxiv.org/pdf/1805.05052v11","link_doi":"","comment":"Machine Learning, Artificial Intelligence, Deep Learning, Data\n Science","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1805.05502v5","submitted":"2018-05-15 00:24:43","updated":"2018-12-04 20:26:48","title":"Nonlinear Dimensionality Reduction for Discriminative Analytics of\n Multiple Datasets","abstract":" Principal component analysis (PCA) is widely used for feature extraction and\ndimensionality reduction, with documented merits in diverse tasks involving\nhigh-dimensional data. Standard PCA copes with one dataset at a time, but it is\nchallenged when it comes to analyzing multiple datasets jointly. In certain\ndata science settings however, one is often interested in extracting the most\ndiscriminative information from one dataset of particular interest (a.k.a.\ntarget data) relative to the other(s) (a.k.a. background data). To this end,\nthis paper puts forth a novel approach, termed discriminative (d) PCA, for such\ndiscriminative analytics of multiple datasets. Under certain conditions, dPCA\nis proved to be least-squares optimal in recovering the component vector unique\nto the target data relative to background data. To account for nonlinear data\ncorrelations, (linear) dPCA models for one or multiple background datasets are\ngeneralized through kernel-based learning. Interestingly, all dPCA variants\nadmit an analytical solution obtainable with a single (generalized) eigenvalue\ndecomposition. Finally, corroborating dimensionality reduction tests using both\nsynthetic and real datasets are provided to validate the effectiveness of the\nproposed methods.\n","authors":"Jia Chen|Gang Wang|Georgios B. Giannakis","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.05502v5","link_pdf":"http://arxiv.org/pdf/1805.05502v5","link_doi":"http://dx.doi.org/10.1109/TSP.2018.2885478","comment":"final version","journal_ref":"","doi":"10.1109/TSP.2018.2885478","primary_category":"cs.LG","categories":"cs.LG|eess.SP|stat.AP|stat.ML"} {"id":"1805.12069v1","submitted":"2018-05-16 22:08:28","updated":"2018-05-16 22:08:28","title":"Omega: An Architecture for AI Unification","abstract":" We introduce the open-ended, modular, self-improving Omega AI unification\narchitecture which is a refinement of Solomonoff's Alpha architecture, as\nconsidered from first principles. The architecture embodies several crucial\nprinciples of general intelligence including diversity of representations,\ndiversity of data types, integrated memory, modularity, and higher-order\ncognition. We retain the basic design of a fundamental algorithmic substrate\ncalled an \"AI kernel\" for problem solving and basic cognitive functions like\nmemory, and a larger, modular architecture that re-uses the kernel in many\nways. Omega includes eight representation languages and six classes of neural\nnetworks, which are briefly introduced. The architecture is intended to\ninitially address data science automation, hence it includes many problem\nsolving methods for statistical tasks. We review the broad software\narchitecture, higher-order cognition, self-improvement, modular neural\narchitectures, intelligent agents, the process and memory hierarchy, hardware\nabstraction, peer-to-peer computing, and data abstraction facility.\n","authors":"Eray Özkural","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.12069v1","link_pdf":"http://arxiv.org/pdf/1805.12069v1","link_doi":"","comment":"This is a high-level overview of the Omega AGI architecture which is\n the basis of a data science automation system. Submitted to a workshop","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI"} {"id":"1805.06829v2","submitted":"2018-05-17 15:42:51","updated":"2019-03-15 07:24:13","title":"Multi-layered Network Structure: Relationship Between Financial and\n Macroeconomic Dynamics","abstract":" We demonstrate using multi-layered networks, the existence of an empirical\nlinkage between the dynamics of the financial network constructed from the\nmarket indices and the macroeconomic networks constructed from macroeconomic\nvariables such as trade, foreign direct investments, etc. for several countries\nacross the globe. The temporal scales of the dynamics of the financial\nvariables and the macroeconomic fundamentals are very different, which make the\nempirical linkage even more interesting and significant. Also, we find that\nthere exist in the respective networks, core-periphery structures (determined\nthrough centrality measures) that are composed of the similar set of countries\n-- a result that may be related through the `gravity model' of the\ncountry-level macroeconomic networks. Thus, from a multi-lateral openness\nperspective, we elucidate that for individual countries, larger trade\nconnectivity is positively associated with higher financial return\ncorrelations. Furthermore, we show that the Economic Complexity Index and the\nequity markets have a positive relationship among themselves, as is the case\nfor Gross Domestic Product. The data science methodology using network theory,\ncoupled with standard econometric techniques constitute a new approach to\nstudying multi-level economic phenomena in a comprehensive manner.\n","authors":"Kiran Sharma|Anindya S. Chakrabarti|Anirban Chakraborti","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.06829v2","link_pdf":"http://arxiv.org/pdf/1805.06829v2","link_doi":"","comment":"43 pages, 7 figures, 31 tables","journal_ref":"","doi":"","primary_category":"econ.GN","categories":"econ.GN|q-fin.EC|q-fin.GN"} {"id":"1805.07325v2","submitted":"2018-05-18 16:58:44","updated":"2018-07-19 20:31:28","title":"Machine learning with force-field inspired descriptors for materials:\n fast screening and mapping energy landscape","abstract":" We present a complete set of chemo-structural descriptors to significantly\nextend the applicability of machine-learning (ML) in material screening and\nmapping energy landscape for multicomponent systems. These new descriptors\nallow differentiating between structural prototypes, which is not possible\nusing the commonly used chemical-only descriptors. Specifically, we demonstrate\nthat the combination of pairwise radial, nearest neighbor, bond-angle,\ndihedral-angle and core-charge distributions plays an important role in\npredicting formation energies, bandgaps, static refractive indices, magnetic\nproperties, and modulus of elasticity for three-dimensional (3D) materials as\nwell as exfoliation energies of two-dimensional (2D) layered materials. The\ntraining data consists of 24549 bulk and 616 monolayer materials taken from\nJARVIS-DFT database. We obtained very accurate ML models using gradient\nboosting algorithm. Then we use the trained models to discover exfoliable\n2D-layered materials satisfying specific property requirements. Additionally,\nwe integrate our formation energy ML model with a genetic algorithm for\nstructure search to verify if the ML model reproduces the DFT convex hull. This\nverification establishes a more stringent evaluation metric for the ML model\nthan what commonly used in data sciences. Our learnt model is publicly\navailable on the JARVIS-ML website (https://www.ctcms.nist.gov/jarvisml )\nproperty predictions of generalized materials.\n","authors":"Kamal Choudhary|Brian DeCost|Francesca Tavazza","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.07325v2","link_pdf":"http://arxiv.org/pdf/1805.07325v2","link_doi":"http://dx.doi.org/10.1103/PhysRevMaterials.2.083801","comment":"","journal_ref":"Phys. Rev. Materials 2, 083801 (2018)","doi":"10.1103/PhysRevMaterials.2.083801","primary_category":"cond-mat.mtrl-sci","categories":"cond-mat.mtrl-sci"} {"id":"1805.08575v2","submitted":"2018-05-20 15:03:09","updated":"2018-12-03 10:35:09","title":"Cost-Benefit Analysis of Data Intelligence -- Its Broader\n Interpretations","abstract":" The core of data science is our fundamental understanding about data\nintelligence processes for transforming data to decisions. One aspect of this\nunderstanding is how to analyze the cost-benefit of data intelligence\nworkflows. This work is built on the information-theoretic metric proposed by\nChen and Golan for this purpose and several recent studies and applications of\nthe metric. We present a set of extended interpretations of the metric by\nrelating the metric to encryption, compression, model development, perception,\ncognition, languages, and news media.\n","authors":"Min Chen","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.08575v2","link_pdf":"http://arxiv.org/pdf/1805.08575v2","link_doi":"","comment":"The first version was archived in May 2018. It was updated in\n December 2018 following a minor revision according to the reviewers' comments\n and suggestions","journal_ref":"","doi":"","primary_category":"cs.OH","categories":"cs.OH"} {"id":"1805.09320v2","submitted":"2018-05-23 08:20:29","updated":"2018-05-25 01:11:54","title":"A New Approach for 4DVar Data Assimilation","abstract":" Four-dimensional variational data assimilation (4DVar) has become an\nincreasingly important tool in data science with wide applications in many\nengineering and scientific fields such as geoscience1-12, biology13 and the\nfinancial industry14. The 4DVar seeks a solution that minimizes the departure\nfrom the background field and the mismatch between the forecast trajectory and\nthe observations within an assimilation window. The current state-of-the-art\n4DVar offers only two choices by using different forms of the forecast model:\nthe strong- and weak-constrained 4DVar approaches15-16. The former ignores the\nmodel error and only corrects the initial condition error at the expense of\nreduced accuracy; while the latter accounts for both the initial and model\nerrors and corrects them separately, which increases computational costs and\nuncertainty. To overcome these limitations, here we develop an integral\ncorrecting 4DVar (i4DVar) approach by treating all errors as a whole and\ncorrecting them simultaneously and indiscriminately. To achieve that, a novel\nexponentially decaying function is proposed to characterize the error evolution\nand correct it at each time step in the i4DVar. As a result, the i4DVar greatly\nenhances the capability of the strong-constrained 4DVar for correcting the\nmodel error while also overcomes the limitation of the weak-constrained 4DVar\nfor being prohibitively expensive with added uncertainty. Numerical experiments\nwith the Lorenz model show that the i4DVar significantly outperforms the\nexisting 4DVar approaches. It has the potential to be applied in many\nscientific and engineering fields and industrial sectors in the big data era\nbecause of its ease of implementation and superior performance.\n","authors":"Xiangjun Tian|Aiguo Dai|Xiaobing Feng|Hongqin Zhang|Rui Han|Lu Zhang","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.09320v2","link_pdf":"http://arxiv.org/pdf/1805.09320v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"physics.data-an","categories":"physics.data-an"} {"id":"1805.09676v2","submitted":"2018-05-24 13:54:24","updated":"2018-06-20 19:50:20","title":"Forming IDEAS Interactive Data Exploration & Analysis System","abstract":" Modern cyber security operations collect an enormous amount of logging and\nalerting data. While analysts have the ability to query and compute simple\nstatistics and plots from their data, current analytical tools are too simple\nto admit deep understanding. To detect advanced and novel attacks, analysts\nturn to manual investigations. While commonplace, current investigations are\ntime-consuming, intuition-based, and proving insufficient. Our hypothesis is\nthat arming the analyst with easy-to-use data science tools will increase their\nwork efficiency, provide them with the ability to resolve hypotheses with\nscientific inquiry of their data, and support their decisions with evidence\nover intuition. To this end, we present our work to build IDEAS (Interactive\nData Exploration and Analysis System). We present three real-world use-cases\nthat drive the system design from the algorithmic capabilities to the user\ninterface. Finally, a modular and scalable software architecture is discussed\nalong with plans for our pilot deployment with a security operation command.\n","authors":"Robert A. Bridges|Maria A. Vincent|Kelly M. T. Huffer|John R. Goodall|Jessie D. Jamieson|Zachary Burch","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.09676v2","link_pdf":"http://arxiv.org/pdf/1805.09676v2","link_doi":"","comment":"4 page short paper on IDEAS System, 4 figures","journal_ref":"Workshop on Information Security Workers, USENIX SOUPS 2018","doi":"","primary_category":"cs.CR","categories":"cs.CR|cs.AI|cs.HC"} {"id":"1805.09966v2","submitted":"2018-05-25 03:36:40","updated":"2018-10-22 19:18:57","title":"Prestige drives epistemic inequality in the diffusion of scientific\n ideas","abstract":" The spread of ideas in the scientific community is often viewed as a\ncompetition, in which good ideas spread further because of greater intrinsic\nfitness, and publication venue and citation counts correlate with importance\nand impact. However, relatively little is known about how structural factors\ninfluence the spread of ideas, and specifically how where an idea originates\nmight influence how it spreads. Here, we investigate the role of faculty hiring\nnetworks, which embody the set of researcher transitions from doctoral to\nfaculty institutions, in shaping the spread of ideas in computer science, and\nthe importance of where in the network an idea originates. We consider\ncomprehensive data on the hiring events of 5032 faculty at all 205\nPh.D.-granting departments of computer science in the U.S. and Canada, and on\nthe timing and titles of 200,476 associated publications. Analyzing five\npopular research topics, we show empirically that faculty hiring can and does\nfacilitate the spread of ideas in science. Having established such a mechanism,\nwe then analyze its potential consequences using epidemic models to simulate\nthe generic spread of research ideas and quantify the impact of where an idea\noriginates on its longterm diffusion across the network. We find that research\nfrom prestigious institutions spreads more quickly and completely than work of\nsimilar quality originating from less prestigious institutions. Our analyses\nestablish the theoretical trade-offs between university prestige and the\nquality of ideas necessary for efficient circulation. Our results establish\nfaculty hiring as an underlying mechanism that drives the persistent epistemic\nadvantage observed for elite institutions, and provide a theoretical lower\nbound for the impact of structural inequality in shaping the spread of ideas in\nscience.\n","authors":"Allison C. Morgan|Dimitrios J. Economou|Samuel F. Way|Aaron Clauset","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.09966v2","link_pdf":"http://arxiv.org/pdf/1805.09966v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-018-0166-4","comment":"10 pages, 8 figures, 1 table","journal_ref":"EPJ Data Science 7, 40 (2018)","doi":"10.1140/epjds/s13688-018-0166-4","primary_category":"cs.SI","categories":"cs.SI|cs.CY|physics.soc-ph"} {"id":"1805.10168v1","submitted":"2018-05-25 14:14:10","updated":"2018-05-25 14:14:10","title":"Futuristic Classification with Dynamic Reference Frame Strategy","abstract":" Classification is one of the widely used analytical techniques in data\nscience domain across different business to associate a pattern which\ncontribute to the occurrence of certain event which is predicted with some\nlikelihood. This Paper address a lacuna of creating some time window before the\nprediction actually happen to enable organizations some space to act on the\nprediction. There are some really good state of the art machine learning\ntechniques to optimally identify the possible churners in either customer base\nor employee base, similarly for fault prediction too if the prediction does not\ncome with some buffer time to act on the fault it is very difficult to provide\na seamless experience to the user. New concept of reference frame creation is\nintroduced to solve this problem in this paper\n","authors":"Kumarjit Pathak|Jitin Kapila|Aasheesh Barvey","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.10168v1","link_pdf":"http://arxiv.org/pdf/1805.10168v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1805.10940v1","submitted":"2018-05-25 15:19:51","updated":"2018-05-25 15:19:51","title":"Personalized Influence Estimation Technique","abstract":" Customer Satisfaction is the most important factors in the industry\nirrespective of domain. Key Driver Analysis is a common practice in data\nscience to help the business to evaluate the same. Understanding key features,\nwhich influence the outcome or dependent feature, is highly important in\nstatistical model building. This helps to eliminate not so important factors\nfrom the model to minimize noise coming from the features, which does not\ncontribute significantly enough to explain the behavior of the dependent\nfeature, which we want to predict. Personalized Influence Estimation is a\ntechnique introduced in this paper, which can estimate key factor influence for\nindividual observations, which contribute most for each observations behavior\npattern based on the dependent class or estimate. Observations can come from\nmultiple business problem i.e. customers related to satisfaction study,\ncustomer related to Fraud Detection, network devices for Fault detection etc.\nIt is highly important to understand the cause of issue at each observation\nlevel to take appropriate Individualized action at customer level or device\nlevel etc. This technique is based on joint behavior of the feature dimension\nfor the specific observation, and relative importance of the feature to\nestimate impact. The technique mentioned in this paper is aimed to help\norganizations to understand each respondents or observations individual key\ncontributing factor of Influence. Result of the experiment is really\nencouraging and able to justify key reasons for churn for majority of the\nsample appropriately\n","authors":"Kumarjit Pathak|Jitin Kapila|Aasheesh Barvey","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.10940v1","link_pdf":"http://arxiv.org/pdf/1805.10940v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1805.11012v1","submitted":"2018-05-28 16:12:18","updated":"2018-05-28 16:12:18","title":"To Bayes or Not To Bayes? That's no longer the question!","abstract":" This paper seeks to provide a thorough account of the ubiquitous nature of\nthe Bayesian paradigm in modern statistics, data science and artificial\nintelligence. Once maligned, on the one hand by those who philosophically hated\nthe very idea of subjective probability used in prior specification, and on the\nother hand because of the intractability of the computations needed for\nBayesian estimation and inference, the Bayesian school of thought now permeates\nand pervades virtually all areas of science, applied science, engineering,\nsocial science and even liberal arts, often in unsuspected ways. Thanks in part\nto the availability of powerful computing resources, but also to the literally\nunavoidable inherent presence of the quintessential building blocks of the\nBayesian paradigm in all walks of life, the Bayesian way of handling\nstatistical learning, estimation and inference is not only mainstream but also\nbecoming the most central approach to learning from the data. This paper\nexplores some of the most relevant elements to help to the reader appreciate\nthe pervading power and presence of the Bayesian paradigm in statistics,\nartificial intelligence and data science, with an emphasis on how the Gospel\naccording to Reverend Thomas Bayes has turned out to be the truly good news,\nand some cases the amazing saving grace, for all who seek to learn\nstatistically from the data. To further help the reader gain deeper and\ntangible practical insights into the Bayesian machinery, we point to some\ncomputational tools designed for the R Statistical Software Environment to help\nexplore Bayesian statistical learning.\n","authors":"Ernest Fokoue","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.11012v1","link_pdf":"http://arxiv.org/pdf/1805.11012v1","link_doi":"","comment":"14 pages, 4 figures","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT|62A01"} {"id":"1805.11557v4","submitted":"2018-05-29 16:05:29","updated":"2018-11-14 01:28:06","title":"Winning Models for GPA, Grit, and Layoff in the Fragile Families\n Challenge","abstract":" In this paper, we discuss and analyze our approach to the Fragile Families\nChallenge. The challenge involved predicting six outcomes for 4,242 children\nfrom disadvantaged families from around the United States. The data consisted\nof over 12,000 features (covariates) about the children and their parents,\nschools, and overall environments from birth to age 9. Our approach relied\nprimarily on existing data science techniques, including: (1) data\npreprocessing: elimination of low variance features, imputation of missing\ndata, and construction of composite features; (2) feature selection through\nunivariate Mutual Information and extraction of non-zero LASSO coefficients;\n(3) three machine learning models: Random Forest, Elastic Net, and\nGradient-Boosted Trees; and finally (4) prediction aggregation according to\nperformance. The top-performing submissions produced winning out-of-sample\npredictions for three outcomes: GPA, grit, and layoff. However, predictions\nwere at most 20% better than a baseline that predicted the mean value of the\ntraining data of each outcome.\n","authors":"Daniel E Rigobon|Eaman Jahani|Yoshihiko Suhara|Khaled AlGhoneim|Abdulaziz Alghunaim|Alex Pentland|Abdullah Almaatouq","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.11557v4","link_pdf":"http://arxiv.org/pdf/1805.11557v4","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.AP","categories":"stat.AP"} {"id":"1805.11800v1","submitted":"2018-05-30 04:23:41","updated":"2018-05-30 04:23:41","title":"Accelerating Large-Scale Data Analysis by Offloading to High-Performance\n Computing Libraries using Alchemist","abstract":" Apache Spark is a popular system aimed at the analysis of large data sets,\nbut recent studies have shown that certain computations---in particular, many\nlinear algebra computations that are the basis for solving common machine\nlearning problems---are significantly slower in Spark than when done using\nlibraries written in a high-performance computing framework such as the\nMessage-Passing Interface (MPI).\n To remedy this, we introduce Alchemist, a system designed to call MPI-based\nlibraries from Apache Spark. Using Alchemist with Spark helps accelerate linear\nalgebra, machine learning, and related computations, while still retaining the\nbenefits of working within the Spark environment. We discuss the motivation\nbehind the development of Alchemist, and we provide a brief overview of its\ndesign and implementation.\n We also compare the performances of pure Spark implementations with those of\nSpark implementations that leverage MPI-based codes via Alchemist. To do so, we\nuse data science case studies: a large-scale application of the conjugate\ngradient method to solve very large linear systems arising in a speech\nclassification problem, where we see an improvement of an order of magnitude;\nand the truncated singular value decomposition (SVD) of a 400GB\nthree-dimensional ocean temperature data set, where we see a speedup of up to\n7.9x. We also illustrate that the truncated SVD computation is easily scalable\nto terabyte-sized data by applying it to data sets of sizes up to 17.6TB.\n","authors":"Alex Gittens|Kai Rothauge|Shusen Wang|Michael W. Mahoney|Lisa Gerhardt| Prabhat|Jey Kottalam|Michael Ringenburg|Kristyn Maschhoff","affiliations":"","link_abstract":"http://arxiv.org/abs/1805.11800v1","link_pdf":"http://arxiv.org/pdf/1805.11800v1","link_doi":"http://dx.doi.org/10.1145/3219819.3219927","comment":"Accepted for publication in Proceedings of the 24th ACM SIGKDD\n International Conference on Knowledge Discovery and Data Mining, London, UK,\n 2018","journal_ref":"","doi":"10.1145/3219819.3219927","primary_category":"cs.DC","categories":"cs.DC|cs.DB|physics.data-an|stat.CO"} {"id":"1806.00069v3","submitted":"2018-05-31 19:48:00","updated":"2019-02-03 21:06:50","title":"Explaining Explanations: An Overview of Interpretability of Machine\n Learning","abstract":" There has recently been a surge of work in explanatory artificial\nintelligence (XAI). This research area tackles the important problem that\ncomplex machines and algorithms often cannot provide insights into their\nbehavior and thought processes. XAI allows users and parts of the internal\nsystem to be more transparent, providing explanations of their decisions in\nsome level of detail. These explanations are important to ensure algorithmic\nfairness, identify potential bias/problems in the training data, and to ensure\nthat the algorithms perform as expected. However, explanations produced by\nthese systems is neither standardized nor systematically assessed. In an effort\nto create best practices and identify open challenges, we provide our\ndefinition of explainability and show how it can be used to classify existing\nliterature. We discuss why current approaches to explanatory methods especially\nfor deep neural networks are insufficient. Finally, based on our survey, we\nconclude with suggested future research directions for explanatory artificial\nintelligence.\n","authors":"Leilani H. Gilpin|David Bau|Ben Z. Yuan|Ayesha Bajwa|Michael Specter|Lalana Kagal","affiliations":"","link_abstract":"http://arxiv.org/abs/1806.00069v3","link_pdf":"http://arxiv.org/pdf/1806.00069v3","link_doi":"","comment":"The 5th IEEE International Conference on Data Science and Advanced\n Analytics (DSAA 2018). [Research Track]","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI|cs.LG|stat.ML"} {"id":"1806.00118v2","submitted":"2018-05-31 22:07:52","updated":"2018-08-12 23:12:45","title":"Statistical Problems with Planted Structures: Information-Theoretical\n and Computational Limits","abstract":" Over the past few years, insights from computer science, statistical physics,\nand information theory have revealed phase transitions in a wide array of\nhigh-dimensional statistical problems at two distinct thresholds: One is the\ninformation-theoretical (IT) threshold below which the observation is too noisy\nso that inference of the ground truth structure is impossible regardless of the\ncomputational cost; the other is the computational threshold above which\ninference can be performed efficiently, i.e., in time that is polynomial in the\ninput size. In the intermediate regime, inference is information-theoretically\npossible, but conjectured to be computationally hard.\n This article provides a survey of the common techniques for determining the\nsharp IT and computational limits, using community detection and submatrix\ndetection as illustrating examples. For IT limits, we discuss tools including\nthe first and second moment method for analyzing the maximum likelihood\nestimator, information-theoretic methods for proving impossibility results\nusing mutual information and rate-distortion theory, and methods originated\nfrom statistical physics such as interpolation method. To investigate\ncomputational limits, we describe a common recipe to construct a randomized\npolynomial-time reduction scheme that approximately maps instances of the\nplanted clique problem to the problem of interest in total variation distance.\n","authors":"Yihong Wu|Jiaming Xu","affiliations":"","link_abstract":"http://arxiv.org/abs/1806.00118v2","link_pdf":"http://arxiv.org/pdf/1806.00118v2","link_doi":"","comment":"Chapter in \"Information-Theoretic Methods in Data Science\". Edited by\n Yonina Eldar and Miguel Rodrigues, Cambridge University Press, forthcoming","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|cs.IT|math.IT|stat.TH"} {"id":"1806.02615v4","submitted":"2018-06-07 11:19:07","updated":"2020-06-04 11:30:00","title":"Explainable AI as a Social Microscope: A Case Study on Academic\n Performance","abstract":" Academic performance is perceived as a product of complex interactions\nbetween students' overall experience, personal characteristics and upbringing.\nData science techniques, most commonly involving regression analysis and\nrelated approaches, serve as a viable means to explore this interplay. However,\nthese tend to extract factors with wide-ranging impact, while overlooking\nvariations specific to individual students. Focusing on each student's\npeculiarities is generally impossible with thousands or even hundreds of\nsubjects, yet data mining methods might prove effective in devising more\ntargeted approaches. For instance, subjects with shared characteristics can be\nassigned to clusters, which can then be examined separately with machine\nlearning algorithms, thereby providing a more nuanced view of the factors\naffecting individuals in a particular group. In this context, we introduce a\ndata science workflow allowing for fine-grained analysis of academic\nperformance correlates that captures the subtle differences in students'\nsensitivities to these factors. Leveraging the Local Interpretable\nModel-Agnostic Explanations (LIME) algorithm from the toolbox of Explainable\nArtificial Intelligence (XAI) techniques, the proposed pipeline yields groups\nof students having similar academic attainment indicators, rather than similar\nfeatures (e.g. familial background) as typically practiced in prior studies. As\na proof-of-concept case study, a rich longitudinal dataset is selected to\nevaluate the effectiveness of the proposed approach versus a standard\nregression model.\n","authors":"Anahit Sargsyan|Areg Karapetyan|Wei Lee Woon|Aamena Alshamsi","affiliations":"","link_abstract":"http://arxiv.org/abs/1806.02615v4","link_pdf":"http://arxiv.org/pdf/1806.02615v4","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1806.03184v1","submitted":"2018-06-08 14:29:03","updated":"2018-06-08 14:29:03","title":"Surgical Data Science: A Consensus Perspective","abstract":" Surgical data science is a scientific discipline with the objective of\nimproving the quality of interventional healthcare and its value through\ncapturing, organization, analysis, and modeling of data. The goal of the 1st\nworkshop on Surgical Data Science was to bring together researchers working on\ndiverse topics in surgical data science in order to discuss existing\nchallenges, potential standards and new research directions in the field.\nInspired by current open space and think tank formats, it was organized in June\n2016 in Heidelberg. While the first day of the workshop, which was dominated by\ninteractive sessions, was open to the public, the second day was reserved for a\nboard meeting on which the information gathered on the public day was processed\nby (1) discussing remaining open issues, (2) deriving a joint definition for\nsurgical data science and (3) proposing potential strategies for advancing the\nfield. This document summarizes the key findings.\n","authors":"Lena Maier-Hein|Matthias Eisenmann|Carolin Feldmann|Hubertus Feussner|Germain Forestier|Stamatia Giannarou|Bernard Gibaud|Gregory D. Hager|Makoto Hashizume|Darko Katic|Hannes Kenngott|Ron Kikinis|Michael Kranzfelder|Anand Malpani|Keno März|Beat Müuller-Stich|Nassir Navab|Thomas Neumuth|Nicolas Padoy|Adrian Park|Carla Pugh|Nicolai Schoch|Danail Stoyanov|Russell Taylor|Martin Wagner|S. Swaroop Vedula|Pierre Jannin*|Stefanie Speidel*","affiliations":"*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors|*: shared senior authors","link_abstract":"http://arxiv.org/abs/1806.03184v1","link_pdf":"http://arxiv.org/pdf/1806.03184v1","link_doi":"","comment":"29 pages","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1806.03226v2","submitted":"2018-06-08 15:46:28","updated":"2018-11-11 21:16:30","title":"Reduction of multivariate mixtures and its applications","abstract":" We consider fast deterministic algorithms to identify the \"best\" linearly\nindependent terms in multivariate mixtures and use them to compute, up to a\nuser-selected accuracy, an equivalent representation with fewer terms. One\nalgorithm employs a pivoted Cholesky decomposition of the Gram matrix\nconstructed from the terms of the mixture to select what we call skeleton terms\nand the other uses orthogonalization for the same purpose. Importantly, the\nmultivariate mixtures do not have to be a separated representation of a\nfunction. Both algorithms require $O(r^2 N + p(d) r N) $ operations, where $N$\nis the initial number of terms in the multivariate mixture, $r$ is the number\nof selected linearly independent terms, and $p(d)$ is the cost of computing the\ninner product between two terms of a mixture in $d$ variables. For general\nGaussian mixtures $p(d) \\sim d^3$ since we need to diagonalize a $d\\times d$\nmatrix, whereas for separated representations $p(d) \\sim d$. Due to\nconditioning issues, the resulting accuracy is limited to about one half of the\navailable significant digits for both algorithms. We also describe an\nalternative algorithm that is capable of achieving higher accuracy but is only\napplicable in low dimensions or to multivariate mixtures in separated form. We\ndescribe a number of initial applications of these algorithms to solve partial\ndifferential and integral equations and to address several problems in data\nscience. For data science applications in high dimensions,we consider the\nkernel density estimation (KDE) approach for constructing a probability density\nfunction (PDF) of a cloud of points, a far-field kernel summation method and\nthe construction of equivalent sources for non-oscillatory kernels (used in\nboth, computational physics and data science) and, finally, show how to use the\nnew algorithm to produce seeds for subdividing a cloud of points into groups.\n","authors":"Gregory Beylkin|Lucas Monzon|Xinshuo Yang","affiliations":"","link_abstract":"http://arxiv.org/abs/1806.03226v2","link_pdf":"http://arxiv.org/pdf/1806.03226v2","link_doi":"http://dx.doi.org/10.1016/j.jcp.2019.01.015","comment":"","journal_ref":"","doi":"10.1016/j.jcp.2019.01.015","primary_category":"math.NA","categories":"math.NA|65R20, 41A63, 41A45"} {"id":"1806.05886v1","submitted":"2018-06-15 10:15:10","updated":"2018-06-15 10:15:10","title":"Automated Image Data Preprocessing with Deep Reinforcement Learning","abstract":" Data preparation, i.e. the process of transforming raw data into a format\nthat can be used for training effective machine learning models, is a tedious\nand time-consuming task. For image data, preprocessing typically involves a\nsequence of basic transformations such as cropping, filtering, rotating or\nflipping images. Currently, data scientists decide manually based on their\nexperience which transformations to apply in which particular order to a given\nimage data set. Besides constituting a bottleneck in real-world data science\nprojects, manual image data preprocessing may yield suboptimal results as data\nscientists need to rely on intuition or trial-and-error approaches when\nexploring the space of possible image transformations and thus might not be\nable to discover the most effective ones. To mitigate the inefficiency and\npotential ineffectiveness of manual data preprocessing, this paper proposes a\ndeep reinforcement learning framework to automatically discover the optimal\ndata preprocessing steps for training an image classifier. The framework takes\nas input sets of labeled images and predefined preprocessing transformations.\nIt jointly learns the classifier and the optimal preprocessing transformations\nfor individual images. Experimental results show that the proposed approach not\nonly improves the accuracy of image classifiers, but also makes them\nsubstantially more robust to noisy inputs at test time.\n","authors":"Tran Ngoc Minh|Mathieu Sinn|Hoang Thanh Lam|Martin Wistuba","affiliations":"","link_abstract":"http://arxiv.org/abs/1806.05886v1","link_pdf":"http://arxiv.org/pdf/1806.05886v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1806.08426v1","submitted":"2018-06-21 21:04:38","updated":"2018-06-21 21:04:38","title":"A validation of the use of data sciences for the study of slope\n stability in open pit mines","abstract":" In this work, we present an exploratory study of stability of an open pit\nmine in the north of Chile with the use of data mining. It is important to note\nthat the study of slope stability is a subject of great interest to mining\ncompanies, this is due to the importance in the safety of workers and the\nprotection of infrastructures, whether private or public, in those places\nsusceptible to this kind of phenomena, as well as, for road slopes and close to\ncommunities or infrastructures, among others. It is also important to highlight\nthat these phenomena can compromise important economic resources and can even\ncause human losses. In our case, this study seeks to increase the knowledge of\nthese phenomena and thus, try to predict their occurrence, by means of risk\nindicators, potentially allowing the mining company supervision to consider\npredictive measures. It should be considered that there is no online test that\nensures timely prediction. In previous studies conducted in other mines, it has\nbeen corroborated that the phenomena and factors associated with the movement\nof slopes and landslides are extremely complex and highly nonlinear, which is\nwhy the methods associated with the called \"data mining\", were found to be\nideal for discovering new information in the data, which is recorded\nperiodically in the continuous monitoring that mining companies have of their\ndeposits, which allowed to find important correlations for the search of\npredictors of these phenomena. Some results, with data coming from different\nsources, are presented at the end of this work. We note that according to the\ninformation provided by the mining company, the results were favorable in the\nindicators of up to six months, giving as risk areas the correct sectors and\npredicting the April 2017 event.\n","authors":"J. H. Ortega|M. Rapiman|L. Rojo|J. P. Rivacoba","affiliations":"","link_abstract":"http://arxiv.org/abs/1806.08426v1","link_pdf":"http://arxiv.org/pdf/1806.08426v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"physics.geo-ph","categories":"physics.geo-ph"} {"id":"1806.09861v1","submitted":"2018-06-26 09:18:46","updated":"2018-06-26 09:18:46","title":"The importance of ensemble techniques for operational space weather\n forecasting","abstract":" The space weather community has begun to use frontier methods such as data\nassimilation, machine learning, and ensemble modeling to advance current\noperational forecasting efforts. This was highlighted by a multi-disciplinary\nsession at the 2017 American Geophysical Union Meeting, 'Frontier\nSolar-Terrestrial Science Enabled by the Combination of Data-Driven Techniques\nand Physics-Based Understanding', with considerable discussion surrounding\nensemble techniques. Here ensemble methods are described in detail; using a set\nof predictions to improve on a single-model output, for example taking a simple\naverage of multiple models, or using more complex techniques for data\nassimilation. They have been used extensively in fields such as numerical\nweather prediction and data science, for both improving model accuracy and\nproviding a measure of model uncertainty. Researchers in the space weather\ncommunity have found them to be similarly useful, and some examples of success\nstories are highlighted in this commentary. Future developments are also\nencouraged to transition these basic research efforts to operational\nforecasting as well as providing prediction errors to aid end-user\nunderstanding.\n","authors":"Sophie A. Murray","affiliations":"","link_abstract":"http://arxiv.org/abs/1806.09861v1","link_pdf":"http://arxiv.org/pdf/1806.09861v1","link_doi":"http://dx.doi.org/10.1029/2018SW001861","comment":"Accepted for publication as invited Commentary in Space Weather. 10\n pages, 3 figures","journal_ref":"Space Weather, 16, 777-783 (2018)","doi":"10.1029/2018SW001861","primary_category":"physics.space-ph","categories":"physics.space-ph|astro-ph.EP|physics.ao-ph"} {"id":"1806.09888v1","submitted":"2018-06-26 10:21:06","updated":"2018-06-26 10:21:06","title":"Towards an understanding of CNNs: analysing the recovery of activation\n pathways via Deep Convolutional Sparse Coding","abstract":" Deep Convolutional Sparse Coding (D-CSC) is a framework reminiscent of deep\nconvolutional neural networks (DCNNs), but by omitting the learning of the\ndictionaries one can more transparently analyse the role of the activation\nfunction and its ability to recover activation paths through the layers.\nPapyan, Romano, and Elad conducted an analysis of such an architecture,\ndemonstrated the relationship with DCNNs and proved conditions under which the\nD-CSC is guaranteed to recover specific activation paths. A technical\ninnovation of their work highlights that one can view the efficacy of the ReLU\nnonlinear activation function of a DCNN through a new variant of the tensor's\nsparsity, referred to as stripe-sparsity. Using this they proved that\nrepresentations with an activation density proportional to the ambient\ndimension of the data are recoverable. We extend their uniform guarantees to a\nmodified model and prove that with high probability the true activation is\ntypically possible to recover for a greater density of activations per layer.\nOur extension follows from incorporating the prior work on one step\nthresholding by Schnass and Vandergheynst.\n","authors":"Michael Murray|Jared Tanner","affiliations":"","link_abstract":"http://arxiv.org/abs/1806.09888v1","link_pdf":"http://arxiv.org/pdf/1806.09888v1","link_doi":"","comment":"Long version (12 pages excluding references) of paper accepted at the\n IEEE 2018 Data Science Workshop","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1806.10695v3","submitted":"2018-06-27 20:50:32","updated":"2020-04-19 16:08:41","title":"Interpolating splines on graphs for data science applications","abstract":" We introduce intrinsic interpolatory bases for data structured on graphs and\nderive properties of those bases. Polyharmonic Lagrange functions are shown to\nsatisfy exponential decay away from their centers. The decay depends on the\ndensity of the zeros of the Lagrange function, showing that they scale with the\ndensity of the data. These results indicate that Lagrange-type bases are ideal\nbuilding blocks for analyzing data on graphs, and we illustrate their use in\nkernel-based machine learning applications.\n","authors":"John Paul Ward|Francis J. Narcowich|Joseph D. Ward","affiliations":"","link_abstract":"http://arxiv.org/abs/1806.10695v3","link_pdf":"http://arxiv.org/pdf/1806.10695v3","link_doi":"","comment":"17 pages","journal_ref":"","doi":"","primary_category":"math.NA","categories":"math.NA|cs.NA|math.CA|41A05, 41A15, 41A65"} {"id":"1807.00347v1","submitted":"2018-07-01 15:55:25","updated":"2018-07-01 15:55:25","title":"Robust Inference Under Heteroskedasticity via the Hadamard Estimator","abstract":" Drawing statistical inferences from large datasets in a model-robust way is\nan important problem in statistics and data science. In this paper, we propose\nmethods that are robust to large and unequal noise in different observational\nunits (i.e., heteroskedasticity) for statistical inference in linear\nregression. We leverage the Hadamard estimator, which is unbiased for the\nvariances of ordinary least-squares regression. This is in contrast to the\npopular White's sandwich estimator, which can be substantially biased in high\ndimensions. We propose to estimate the signal strength, noise level,\nsignal-to-noise ratio, and mean squared error via the Hadamard estimator. We\ndevelop a new degrees of freedom adjustment that gives more accurate confidence\nintervals than variants of White's sandwich estimator. Moreover, we provide\nconditions ensuring the estimator is well-defined, by studying a new random\nmatrix ensemble in which the entries of a random orthogonal projection matrix\nare squared. We also show approximate normality, using the second-order\nPoincare inequality. Our work provides improved statistical theory and methods\nfor linear regression in high dimensions.\n","authors":"Edgar Dobriban|Weijie J. Su","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.00347v1","link_pdf":"http://arxiv.org/pdf/1807.00347v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|stat.ME|stat.TH"} {"id":"1807.00349v1","submitted":"2018-07-01 16:08:41","updated":"2018-07-01 16:08:41","title":"Heuristic Framework for Multi-Scale Testing of the Multi-Manifold\n Hypothesis","abstract":" When analyzing empirical data, we often find that global linear models\noverestimate the number of parameters required. In such cases, we may ask\nwhether the data lies on or near a manifold or a set of manifolds (a so-called\nmulti-manifold) of lower dimension than the ambient space. This question can be\nphrased as a (multi-) manifold hypothesis. The identification of such intrinsic\nmultiscale features is a cornerstone of data analysis and representation and\nhas given rise to a large body of work on manifold learning. In this work, we\nreview key results on multi-scale data analysis and intrinsic dimension\nfollowed by the introduction of a heuristic, multiscale framework for testing\nthe multi-manifold hypothesis. Our method implements a hypothesis test on a set\nof spline-interpolated manifolds constructed from variance-based intrinsic\ndimensions. The workflow is suitable for empirical data analysis as we\ndemonstrate on two use cases.\n","authors":"F. Patricia Medina|Linda Ness|Melanie Weber|Karamatou Yacoubou Djima","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.00349v1","link_pdf":"http://arxiv.org/pdf/1807.00349v1","link_doi":"","comment":"Workshop paper (Women in Data Science and Mathematics Research\n Collaboration Workshop (WiSDM); ICERM, July 2017)","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG|57-04"} {"id":"1807.01401v1","submitted":"2018-07-03 23:35:47","updated":"2018-07-03 23:35:47","title":"Endmember Extraction on the Grassmannian","abstract":" Endmember extraction plays a prominent role in a variety of data analysis\nproblems as endmembers often correspond to data representing the purest or best\nrepresentative of some feature. Identifying endmembers then can be useful for\nfurther identification and classification tasks. In settings with\nhigh-dimensional data, such as hyperspectral imagery, it can be useful to\nconsider endmembers that are subspaces as they are capable of capturing a wider\nrange of variations of a signature. The endmember extraction problem in this\nsetting thus translates to finding the vertices of the convex hull of a set of\npoints on a Grassmannian. In the presence of noise, it can be less clear\nwhether a point should be considered a vertex. In this paper, we propose an\nalgorithm to extract endmembers on a Grassmannian, identify subspaces of\ninterest that lie near the boundary of a convex hull, and demonstrate the use\nof the algorithm on a synthetic example and on the 220 spectral band AVIRIS\nIndian Pines hyperspectral image.\n","authors":"Elin Farnell|Henry Kvinge|Michael Kirby|Chris Peterson","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.01401v1","link_pdf":"http://arxiv.org/pdf/1807.01401v1","link_doi":"","comment":"To appear in Proceedings of the 2018 IEEE Data Science Workshop,\n Lausanne, Switzerland","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV|cs.LG|eess.IV|eess.SP"} {"id":"1807.02013v1","submitted":"2018-07-05 14:11:41","updated":"2018-07-05 14:11:41","title":"Dynamic network identification from non-stationary vector autoregressive\n time series","abstract":" Learning the dynamics of complex systems features a large number of\napplications in data science. Graph-based modeling and inference underpins the\nmost prominent family of approaches to learn complex dynamics due to their\nability to capture the intrinsic sparsity of direct interactions in such\nsystems. They also provide the user with interpretable graphs that unveil\nbehavioral patterns and changes. To cope with the time-varying nature of\ninteractions, this paper develops an estimation criterion and a solver to learn\nthe parameters of a time-varying vector autoregressive model supported on a\nnetwork of time series. The notion of local breakpoint is proposed to\naccommodate changes at individual edges. It contrasts with existing works,\nwhich assume that changes at all nodes are aligned in time. Numerical\nexperiments validate the proposed schemes.\n","authors":"Luis M. Lopez-Ramos|Daniel Romero|Bakht Zaman|Baltasar Beferull-Lozano","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.02013v1","link_pdf":"http://arxiv.org/pdf/1807.02013v1","link_doi":"","comment":"5 pages, 2 figures, conference paper submitted to GlobalSIP2018","journal_ref":"","doi":"","primary_category":"eess.SP","categories":"eess.SP"} {"id":"1807.03750v1","submitted":"2018-07-05 21:32:18","updated":"2018-07-05 21:32:18","title":"Navigating Diverse Data Science Learning: Critical Reflections Towards\n Future Practice","abstract":" Data Science is currently a popular field of science attracting expertise\nfrom very diverse backgrounds. Current learning practices need to acknowledge\nthis and adapt to it. This paper summarises some experiences relating to such\nlearning approaches from teaching a postgraduate Data Science module, and draws\nsome learned lessons that are of relevance to others teaching Data Science.\n","authors":"Yehia Elkhatib","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.03750v1","link_pdf":"http://arxiv.org/pdf/1807.03750v1","link_doi":"http://dx.doi.org/10.1109/CloudCom.2017.58","comment":"","journal_ref":"4th Workshop on Curricula and Teaching Methods in Cloud Computing,\n Big Data, and Data Science, 2017","doi":"10.1109/CloudCom.2017.58","primary_category":"cs.GL","categories":"cs.GL|cs.LG|stat.ML"} {"id":"1807.02221v2","submitted":"2018-07-06 01:59:42","updated":"2018-07-10 22:39:44","title":"The Data Science of Hollywood: Using Emotional Arcs of Movies to Drive\n Business Model Innovation in Entertainment Industries","abstract":" Much of business literature addresses the issues of consumer-centric design:\nhow can businesses design customized services and products which accurately\nreflect consumer preferences? This paper uses data science natural language\nprocessing methodology to explore whether and to what extent emotions shape\nconsumer preferences for media and entertainment content. Using a unique\nfiltered dataset of 6,174 movie scripts, we generate a mapping of screen\ncontent to capture the emotional trajectory of each motion picture. We then\ncombine the obtained mappings into clusters which represent groupings of\nconsumer emotional journeys. These clusters are used to predict overall success\nparameters of the movies including box office revenues, viewer satisfaction\nlevels (captured by IMDb ratings), awards, as well as the number of viewers'\nand critics' reviews. We find that like books all movie stories are dominated\nby 6 basic shapes. The highest box offices are associated with the Man in a\nHole shape which is characterized by an emotional fall followed by an emotional\nrise. This shape results in financially successful movies irrespective of genre\nand production budget. Yet, Man in a Hole succeeds not because it produces most\n\"liked\" movies but because it generates most \"talked about\" movies.\nInterestingly, a carefully chosen combination of production budget and genre\nmay produce a financially successful movie with any emotional shape.\nImplications of this analysis for generating on-demand content and for driving\nbusiness model innovation in entertainment industries are discussed.\n","authors":"Marco Del Vecchio|Alexander Kharlamov|Glenn Parry|Ganna Pogrebna","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.02221v2","link_pdf":"http://arxiv.org/pdf/1807.02221v2","link_doi":"http://dx.doi.org/10.1080/01605682.2019.1705194","comment":"","journal_ref":"Journal of the Operational Research Society, 2020","doi":"10.1080/01605682.2019.1705194","primary_category":"cs.CL","categories":"cs.CL|cs.CY"} {"id":"1807.02222v1","submitted":"2018-07-06 02:20:16","updated":"2018-07-06 02:20:16","title":"Digital Geometry, a Survey","abstract":" This paper provides an overview of modern digital geometry and topology\nthrough mathematical principles, algorithms, and measurements. It also covers\nrecent developments in the applications of digital geometry and topology\nincluding image processing, computer vision, and data science. Recent research\nstrongly showed that digital geometry has made considerable contributions to\nmodelings and algorithms in image segmentation, algorithmic analysis, and\nBigData analytics.\n","authors":"Li Chen|David Coeurjolly","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.02222v1","link_pdf":"http://arxiv.org/pdf/1807.02222v1","link_doi":"","comment":"9 pages ; 9 figures","journal_ref":"","doi":"","primary_category":"cs.DM","categories":"cs.DM|cs.GR"} {"id":"1807.02876v3","submitted":"2018-07-08 20:20:19","updated":"2019-05-16 05:13:31","title":"Machine Learning in High Energy Physics Community White Paper","abstract":" Machine learning has been applied to several problems in particle physics\nresearch, beginning with applications to high-level physics analysis in the\n1990s and 2000s, followed by an explosion of applications in particle and event\nidentification and reconstruction in the 2010s. In this document we discuss\npromising future research and development areas for machine learning in\nparticle physics. We detail a roadmap for their implementation, software and\nhardware resource requirements, collaborative initiatives with the data science\ncommunity, academia and industry, and training the particle physics community\nin data science. The main objective of the document is to connect and motivate\nthese areas of research and development with the physics drivers of the\nHigh-Luminosity Large Hadron Collider and future neutrino experiments and\nidentify the resource needs for their implementation. Additionally we identify\nareas where collaboration with external communities will be of great benefit.\n","authors":"Kim Albertsson|Piero Altoe|Dustin Anderson|John Anderson|Michael Andrews|Juan Pedro Araque Espinosa|Adam Aurisano|Laurent Basara|Adrian Bevan|Wahid Bhimji|Daniele Bonacorsi|Bjorn Burkle|Paolo Calafiura|Mario Campanelli|Louis Capps|Federico Carminati|Stefano Carrazza|Yi-fan Chen|Taylor Childers|Yann Coadou|Elias Coniavitis|Kyle Cranmer|Claire David|Douglas Davis|Andrea De Simone|Javier Duarte|Martin Erdmann|Jonas Eschle|Amir Farbin|Matthew Feickert|Nuno Filipe Castro|Conor Fitzpatrick|Michele Floris|Alessandra Forti|Jordi Garra-Tico|Jochen Gemmler|Maria Girone|Paul Glaysher|Sergei Gleyzer|Vladimir Gligorov|Tobias Golling|Jonas Graw|Lindsey Gray|Dick Greenwood|Thomas Hacker|John Harvey|Benedikt Hegner|Lukas Heinrich|Ulrich Heintz|Ben Hooberman|Johannes Junggeburth|Michael Kagan|Meghan Kane|Konstantin Kanishchev|Przemysław Karpiński|Zahari Kassabov|Gaurav Kaul|Dorian Kcira|Thomas Keck|Alexei Klimentov|Jim Kowalkowski|Luke Kreczko|Alexander Kurepin|Rob Kutschke|Valentin Kuznetsov|Nicolas Köhler|Igor Lakomov|Kevin Lannon|Mario Lassnig|Antonio Limosani|Gilles Louppe|Aashrita Mangu|Pere Mato|Narain Meenakshi|Helge Meinhard|Dario Menasce|Lorenzo Moneta|Seth Moortgat|Mark Neubauer|Harvey Newman|Sydney Otten|Hans Pabst|Michela Paganini|Manfred Paulini|Gabriel Perdue|Uzziel Perez|Attilio Picazio|Jim Pivarski|Harrison Prosper|Fernanda Psihas|Alexander Radovic|Ryan Reece|Aurelius Rinkevicius|Eduardo Rodrigues|Jamal Rorie|David Rousseau|Aaron Sauers|Steven Schramm|Ariel Schwartzman|Horst Severini|Paul Seyfert|Filip Siroky|Konstantin Skazytkin|Mike Sokoloff|Graeme Stewart|Bob Stienen|Ian Stockdale|Giles Strong|Wei Sun|Savannah Thais|Karen Tomko|Eli Upfal|Emanuele Usai|Andrey Ustyuzhanin|Martin Vala|Justin Vasel|Sofia Vallecorsa|Mauro Verzetti|Xavier Vilasís-Cardona|Jean-Roch Vlimant|Ilija Vukotic|Sean-Jiun Wang|Gordon Watts|Michael Williams|Wenjing Wu|Stefan Wunsch|Kun Yang|Omar Zapata","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.02876v3","link_pdf":"http://arxiv.org/pdf/1807.02876v3","link_doi":"","comment":"Editors: Sergei Gleyzer, Paul Seyfert and Steven Schramm","journal_ref":"","doi":"","primary_category":"physics.comp-ph","categories":"physics.comp-ph|cs.LG|hep-ex|stat.ML"} {"id":"1807.02989v2","submitted":"2018-07-09 08:48:15","updated":"2018-11-02 09:37:35","title":"Spatio-temporal variations in the urban rhythm: the travelling waves of\n crime","abstract":" In the last decades, the notion that cities are in a state of equilibrium\nwith a centralised organisation has given place to the viewpoint of cities in\ndisequilibrium and organised from bottom to up. In this perspective, cities are\nevolving systems that exhibit emergent phenomena built from local decisions.\nWhile urban evolution promotes the emergence of positive social phenomena such\nas the formation of innovation hubs and the increase in cultural diversity, it\nalso yields negative phenomena such as increases in criminal activity. Yet, we\nare still far from understanding the driving mechanisms of these phenomena. In\nparticular, approaches to analyse urban phenomena are limited in scope by\nneglecting both temporal non-stationarity and spatial heterogeneity. In the\ncase of criminal activity, we know for more than one century that crime peaks\nduring specific times of the year, but the literature still fails to\ncharacterise the mobility of crime. Here we develop an approach to describe the\nspatial, temporal, and periodic variations in urban quantities. With crime data\nfrom 12 cities, we characterise how the periodicity of crime varies spatially\nacross the city over time. We confirm one-year criminal cycles and show that\nthis periodicity occurs unevenly across the city. These `waves of crime' keep\ntravelling across the city: while cities have a stable number of regions with a\ncircannual period, the regions exhibit non-stationary series. Our findings\nsupport the concept of cities in a constant change, influencing urban\nphenomena---in agreement with the notion of cities not in equilibrium.\n","authors":"Marcos Oliveira|Eraldo Ribeiro|Carmelo Bastos-Filho|Ronaldo Menezes","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.02989v2","link_pdf":"http://arxiv.org/pdf/1807.02989v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-018-0158-4","comment":"11 pages, 4 figures","journal_ref":"EPJ Data Science 2018 7:29","doi":"10.1140/epjds/s13688-018-0158-4","primary_category":"cs.CY","categories":"cs.CY|cs.SI|physics.soc-ph"} {"id":"1807.03662v1","submitted":"2018-07-10 14:10:54","updated":"2018-07-10 14:10:54","title":"TrialChain: A Blockchain-Based Platform to Validate Data Integrity in\n Large, Biomedical Research Studies","abstract":" The governance of data used for biomedical research and clinical trials is an\nimportant requirement for generating accurate results. To improve the\nvisibility of data quality and analysis, we developed TrialChain, a\nblockchain-based platform that can be used to validate data integrity from\nlarge, biomedical research studies. We implemented a private blockchain using\nthe MultiChain platform and integrated it with a data science platform deployed\nwithin a large research center. An administrative web application was built\nwith Python to manage the platform, which was built with a microservice\narchitecture using Docker. The TrialChain platform was integrated during data\nacquisition into our existing data science platform. Using NiFi, data were\nhashed and logged within the local blockchain infrastructure. To provide public\nvalidation, the local blockchain state was periodically synchronized to the\npublic Ethereum network. The use of a combined private/public blockchain\nplatform allows for both public validation of results while maintaining\nadditional security and lower cost for blockchain transactions. Original data\nand modifications due to downstream analysis can be logged within TrialChain\nand data assets or results can be rapidly validated when needed using API calls\nto the platform. The TrialChain platform provides a data governance solution to\naudit the acquisition and analysis of biomedical research data. The platform\nprovides cryptographic assurance of data authenticity and can also be used to\ndocument data analysis.\n","authors":"Hao Dai|H Patrick Young|Thomas JS Durant|Guannan Gong|Mingming Kang|Harlan M Krumholz|Wade L Schulz|Lixin Jiang","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.03662v1","link_pdf":"http://arxiv.org/pdf/1807.03662v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DC","categories":"cs.DC|cs.CR"} {"id":"1807.06002v1","submitted":"2018-07-14 12:42:19","updated":"2018-07-14 12:42:19","title":"Evaluation as a Service architecture and crowdsourced problems solving\n implemented in Optil.io platform","abstract":" Reliable and trustworthy evaluation of algorithms is a challenging process.\nFirstly, each algorithm has its strengths and weaknesses, and the selection of\ntest instances can significantly influence the assessment process. Secondly,\nthe measured performance of the algorithm highly depends on the test\nenvironment architecture, i.e., CPU model, available memory, cache\nconfiguration, operating system's kernel, and even compilation flags. Finally,\nit is often difficult to compare algorithm with software prepared by other\nresearchers. Evaluation as a Service (EaaS) is a cloud computing architecture\nthat tries to make assessment process more reliable by providing online tools\nand test instances dedicated to the evaluation of algorithms. One of such\nplatforms is Optil.io which gives the possibility to define problems, store\nevaluation data and evaluate solutions submitted by researchers in almost real\ntime. In this paper, we briefly present this platform together with four\nchallenges that were organized with its support.\n","authors":"Szymon Wasik|Maciej Antczak|Jan Badura|Artur Laskowski","affiliations":"Institute of Computing Science, Poznan University of Technology|Institute of Computing Science, Poznan University of Technology|Institute of Computing Science, Poznan University of Technology|Institute of Computing Science, Poznan University of Technology","link_abstract":"http://arxiv.org/abs/1807.06002v1","link_pdf":"http://arxiv.org/pdf/1807.06002v1","link_doi":"","comment":"Data Science meets Optimization Workshop, Federated Artificial\n Intelligence Meeting 2018","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.PF"} {"id":"1807.05691v2","submitted":"2018-07-16 06:21:54","updated":"2019-01-25 06:19:21","title":"Teaching machines to understand data science code by semantic enrichment\n of dataflow graphs","abstract":" Your computer is continuously executing programs, but does it really\nunderstand them? Not in any meaningful sense. That burden falls upon human\nknowledge workers, who are increasingly asked to write and understand code.\nThey deserve to have intelligent tools that reveal the connections between code\nand its subject matter. Towards this prospect, we develop an AI system that\nforms semantic representations of computer programs, using techniques from\nknowledge representation and program analysis. To create the representations,\nwe introduce an algorithm for enriching dataflow graphs with semantic\ninformation. The semantic enrichment algorithm is undergirded by a new ontology\nlanguage for modeling computer programs and a new ontology about data science,\nwritten in this language. Throughout the paper, we focus on code written by\ndata scientists and we locate our work within a larger movement towards\ncollaborative, open, and reproducible science.\n","authors":"Evan Patterson|Ioana Baldini|Aleksandra Mojsilovic|Kush R. Varshney","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.05691v2","link_pdf":"http://arxiv.org/pdf/1807.05691v2","link_doi":"","comment":"33 pages. Significantly expanded from previous version","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI|cs.SE"} {"id":"1807.07346v1","submitted":"2018-07-19 11:29:40","updated":"2018-07-19 11:29:40","title":"Indexing Execution Patterns in Workflow Provenance Graphs through\n Generalized Trie Structures","abstract":" Over the last years, scientific workflows have become mature enough to be\nused in a production style. However, despite the increasing maturity, there is\nstill a shortage of tools for searching, adapting, and reusing workflows that\nhinders a more generalized adoption by the scientific communities. Indeed, due\nto the limited availability of machine-readable scientific metadata and the\nheterogeneity of workflow specification formats and representations, new ways\nto leverage alternative sources of information that complement existing\napproaches are needed. In this paper we address such limitations by applying\nstatistically enriched generalized trie structures to exploit workflow\nexecution provenance information in order to assist the analysis, indexing and\nsearch of scientific workflows. Our method bridges the gap between the\ndescription of what a workflow is supposed to do according to its specification\nand related metadata and what it actually does as recorded in its provenance\nexecution trace. In doing so, we also prove that the proposed method\noutperforms SPARQL 1.1 Property Paths for querying provenance graphs.\n","authors":"Esteban García-Cuesta|José M. Gómez-Pérez","affiliations":"Data Science Laboratory, School of Arquitecture, Engineering and Design, Universidad Europea de Madrid, Spain|Expert System, Spain","link_abstract":"http://arxiv.org/abs/1807.07346v1","link_pdf":"http://arxiv.org/pdf/1807.07346v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB|cs.IR|cs.LG"} {"id":"1807.07351v1","submitted":"2018-07-19 11:44:10","updated":"2018-07-19 11:44:10","title":"Can We Assess Mental Health through Social Media and Smart Devices?\n Addressing Bias in Methodology and Evaluation","abstract":" Predicting mental health from smartphone and social media data on a\nlongitudinal basis has recently attracted great interest, with very promising\nresults being reported across many studies. Such approaches have the potential\nto revolutionise mental health assessment, if their development and evaluation\nfollows a real world deployment setting. In this work we take a closer look at\nstate-of-the-art approaches, using different mental health datasets and\nindicators, different feature sources and multiple simulations, in order to\nassess their ability to generalise. We demonstrate that under a pragmatic\nevaluation framework, none of the approaches deliver or even approach the\nreported performances. In fact, we show that current state-of-the-art\napproaches can barely outperform the most na\\\"ive baselines in the real-world\nsetting, posing serious questions not only about their deployment ability, but\nalso about the contribution of the derived features for the mental health\nassessment task and how to make better use of such data in the future.\n","authors":"Adam Tsakalidis|Maria Liakata|Theo Damoulas|Alexandra I. Cristea","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.07351v1","link_pdf":"http://arxiv.org/pdf/1807.07351v1","link_doi":"","comment":"Preprint accepted for publication in the European Conference on\n Machine Learning and Principles and Practice of Knowledge Discovery in\n Databases (ECML-PKDD 2018 Applied Data Science Track)","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.CL"} {"id":"1807.08383v1","submitted":"2018-07-22 23:29:28","updated":"2018-07-22 23:29:28","title":"PaloBoost: An Overfitting-robust TreeBoost with Out-of-Bag Sample\n Regularization Techniques","abstract":" Stochastic Gradient TreeBoost is often found in many winning solutions in\npublic data science challenges. Unfortunately, the best performance requires\nextensive parameter tuning and can be prone to overfitting. We propose\nPaloBoost, a Stochastic Gradient TreeBoost model that uses novel regularization\ntechniques to guard against overfitting and is robust to parameter settings.\nPaloBoost uses the under-utilized out-of-bag samples to perform gradient-aware\npruning and estimate adaptive learning rates. Unlike other Stochastic Gradient\nTreeBoost models that use the out-of-bag samples to estimate test errors,\nPaloBoost treats the samples as a second batch of training samples to prune the\ntrees and adjust the learning rates. As a result, PaloBoost can dynamically\nadjust tree depths and learning rates to achieve faster learning at the start\nand slower learning as the algorithm converges. We illustrate how these\nregularization techniques can be efficiently implemented and propose a new\nformula for calculating feature importance to reflect the node coverages and\nlearning rates. Extensive experimental results on seven datasets demonstrate\nthat PaloBoost is robust to overfitting, is less sensitivity to the parameters,\nand can also effectively identify meaningful features.\n","authors":"Yubin Park|Joyce C. Ho","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.08383v1","link_pdf":"http://arxiv.org/pdf/1807.08383v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG|stat.ME"} {"id":"1807.09127v2","submitted":"2018-07-23 06:03:24","updated":"2018-08-13 06:55:28","title":"Talent Flow Analytics in Online Professional Network","abstract":" Analyzing job hopping behavior is important for understanding job preference\nand career progression of working individuals. When analyzed at the workforce\npopulation level, job hop analysis helps to gain insights of talent flow among\ndifferent jobs and organizations. Traditionally, surveys are conducted on job\nseekers and employers to study job hop behavior. Beyond surveys, job hop\nbehavior can also be studied in a highly scalable and timely manner using a\ndata driven approach in response to fast-changing job landscape. Fortunately,\nthe advent of online professional networks (OPNs) has made it possible to\nperform a large-scale analysis of talent flow. In this paper, we present a new\ndata analytics framework to analyze the talent flow patterns of close to 1\nmillion working professionals from three different countries/regions using\ntheir publicly-accessible profiles in an established OPN. As OPN data are\noriginally generated for professional networking applications, our proposed\nframework re-purposes the same data for a different analytics task. Prior to\nperforming job hop analysis, we devise a job title normalization procedure to\nmitigate the amount of noise in the OPN data. We then devise several metrics to\nmeasure the amount of work experience required to take up a job, to determine\nthat existence duration of the job (also known as the job age), and the\ncorrelation between the above metric and propensity of hopping. We also study\nhow job hop behavior is related to job promotion/demotion. Lastly, we perform\nconnectivity analysis at job and organization levels to derive insights on\ntalent flow as well as job and organizational competitiveness.\n","authors":"Richard J. Oentaryo|Ee-Peng Lim|Xavier Jayaraj Siddarth Ashok|Philips Kokoh Prasetyo|Koon Han Ong|Zi Quan Lau","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.09127v2","link_pdf":"http://arxiv.org/pdf/1807.09127v2","link_doi":"http://dx.doi.org/10.1007/s41019-018-0070-8","comment":"arXiv admin note: extension of arXiv:1711.05887, Data Science and\n Engineering, 2018","journal_ref":"","doi":"10.1007/s41019-018-0070-8","primary_category":"cs.SI","categories":"cs.SI"} {"id":"1807.08712v1","submitted":"2018-07-23 16:40:37","updated":"2018-07-23 16:40:37","title":"Data Science with Vadalog: Bridging Machine Learning and Reasoning","abstract":" Following the recent successful examples of large technology companies, many\nmodern enterprises seek to build knowledge graphs to provide a unified view of\ncorporate knowledge and to draw deep insights using machine learning and\nlogical reasoning. There is currently a perceived disconnect between the\ntraditional approaches for data science, typically based on machine learning\nand statistical modelling, and systems for reasoning with domain knowledge. In\nthis paper we present a state-of-the-art Knowledge Graph Management System,\nVadalog, which delivers highly expressive and efficient logical reasoning and\nprovides seamless integration with modern data science toolkits, such as the\nJupyter platform. We demonstrate how to use Vadalog to perform traditional data\nwrangling tasks, as well as complex logical and probabilistic reasoning. We\nargue that this is a significant step forward towards combining machine\nlearning and reasoning in data science.\n","authors":"Luigi Bellomarini|Ruslan R. Fayzrakhmanov|Georg Gottlob|Andrey Kravchenko|Eleonora Laurenza|Yavor Nenov|Stephane Reissfelder|Emanuel Sallinger|Evgeny Sherkhonov|Lianlong Wu","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.08712v1","link_pdf":"http://arxiv.org/pdf/1807.08712v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB|cs.AI"} {"id":"1807.09011v1","submitted":"2018-07-24 10:15:49","updated":"2018-07-24 10:15:49","title":"Uncertainty Modelling in Deep Networks: Forecasting Short and Noisy\n Series","abstract":" Deep Learning is a consolidated, state-of-the-art Machine Learning tool to\nfit a function when provided with large data sets of examples. However, in\nregression tasks, the straightforward application of Deep Learning models\nprovides a point estimate of the target. In addition, the model does not take\ninto account the uncertainty of a prediction. This represents a great\nlimitation for tasks where communicating an erroneous prediction carries a\nrisk. In this paper we tackle a real-world problem of forecasting impending\nfinancial expenses and incomings of customers, while displaying predictable\nmonetary amounts on a mobile app. In this context, we investigate if we would\nobtain an advantage by applying Deep Learning models with a Heteroscedastic\nmodel of the variance of a network's output. Experimentally, we achieve a\nhigher accuracy than non-trivial baselines. More importantly, we introduce a\nmechanism to discard low-confidence predictions, which means that they will not\nbe visible to users. This should help enhance the user experience of our\nproduct.\n","authors":"Axel Brando|Jose A. Rodríguez-Serrano|Mauricio Ciprian|Roberto Maestre|Jordi Vitrià","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.09011v1","link_pdf":"http://arxiv.org/pdf/1807.09011v1","link_doi":"","comment":"17 pages, 5 figures, Applied Data Science Track of The European\n Conference on Machine Learning and Principles and Practice of Knowledge\n Discovery in Databases (ECML-PKDD 2018)","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1807.09247v3","submitted":"2018-07-24 17:34:08","updated":"2018-10-02 02:39:09","title":"Structural biology meets data science: Does anything change?","abstract":" Data science has emerged from the proliferation of digital data, coupled with\nadvances in algorithms, software and hardware (e.g., GPU computing).\nInnovations in structural biology have been driven by similar factors, spurring\nus to ask: can these two fields impact one another in deep and hitherto\nunforeseen ways? We posit that the answer is yes. New biological knowledge lies\nin the relationships between sequence, structure, function and disease, all of\nwhich play out on the stage of evolution, and data science enables us to\nelucidate these relationships at scale. Here, we consider the above question\nfrom the five key pillars of data science: acquisition, engineering, analytics,\nvisualization and policy, with an emphasis on machine learning as the premier\nanalytics approach.\n","authors":"Cameron Mura|Eli J. Draizen|Philip E. Bourne","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.09247v3","link_pdf":"http://arxiv.org/pdf/1807.09247v3","link_doi":"http://dx.doi.org/10.1016/j.sbi.2018.09.003","comment":"20 pages total, 2 figures, 1 item of supplementary material","journal_ref":"vol 52, 2018, pp 95-102","doi":"10.1016/j.sbi.2018.09.003","primary_category":"q-bio.QM","categories":"q-bio.QM"} {"id":"1807.11887v3","submitted":"2018-07-31 15:59:29","updated":"2019-01-08 20:23:53","title":"Gaussian Process Landmarking for Three-Dimensional Geometric\n Morphometrics","abstract":" We demonstrate applications of the Gaussian process-based landmarking\nalgorithm proposed in [T. Gao, S.Z. Kovalsky, and I. Daubechies, SIAM Journal\non Mathematics of Data Science (2019)] to geometric morphometrics, a branch of\nevolutionary biology centered at the analysis and comparisons of anatomical\nshapes, and compares the automatically sampled landmarks with the \"ground\ntruth\" landmarks manually placed by evolutionary anthropologists; the results\nsuggest that Gaussian process landmarks perform equally well or better, in\nterms of both spatial coverage and downstream statistical analysis. We provide\na detailed exposition of numerical procedures and feature filtering algorithms\nfor computing high-quality and semantically meaningful diffeomorphisms between\ndisk-type anatomical surfaces.\n","authors":"Tingran Gao|Shahar Z. Kovalsky|Doug M. Boyer|Ingrid Daubechies","affiliations":"","link_abstract":"http://arxiv.org/abs/1807.11887v3","link_pdf":"http://arxiv.org/pdf/1807.11887v3","link_doi":"","comment":"41 pages, 17 figures, 3 tables. Some portions of this work appeared\n earlier as arXiv:1802.03479, which was split into 2 parts during the\n refereeing process. This version combines the main text with the supplemental\n materials. Figure sizes have been reduced to meet arxiv size limit","journal_ref":"","doi":"","primary_category":"stat.AP","categories":"stat.AP|60G15, 62K05, 65D18|I.3.5; I.2.6"} {"id":"1808.00156v2","submitted":"2018-07-31 17:16:52","updated":"2019-11-28 12:05:18","title":"Smart Grids Data Analysis: A Systematic Mapping Study","abstract":" Data analytics and data science play a significant role in nowadays society.\nIn the context of Smart Grids (SG), the collection of vast amounts of data has\nseen the emergence of a plethora of data analysis approaches. In this paper, we\nconduct a Systematic Mapping Study (SMS) aimed at getting insights about\ndifferent facets of SG data analysis: application sub-domains (e.g., power load\ncontrol), aspects covered (e.g., forecasting), used techniques (e.g.,\nclustering), tool-support, research methods (e.g., experiments/simulations),\nreplicability/reproducibility of research. The final goal is to provide a view\nof the current status of research. Overall, we found that each sub-domain has\nits peculiarities in terms of techniques, approaches and research methodologies\napplied. Simulations and experiments play a crucial role in many areas. The\nreplicability of studies is limited concerning the provided implemented\nalgorithms, and to a lower extent due to the usage of private datasets.\n","authors":"Bruno Rossi|Stanislav Chren","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.00156v2","link_pdf":"http://arxiv.org/pdf/1808.00156v2","link_doi":"http://dx.doi.org/10.1109/TII.2019.2954098","comment":"","journal_ref":"","doi":"10.1109/TII.2019.2954098","primary_category":"cs.OH","categories":"cs.OH"} {"id":"1808.00197v1","submitted":"2018-08-01 07:07:15","updated":"2018-08-01 07:07:15","title":"MaxMin Linear Initialization for Fuzzy C-Means","abstract":" Clustering is an extensive research area in data science. The aim of\nclustering is to discover groups and to identify interesting patterns in\ndatasets. Crisp (hard) clustering considers that each data point belongs to one\nand only one cluster. However, it is inadequate as some data points may belong\nto several clusters, as is the case in text categorization. Thus, we need more\nflexible clustering. Fuzzy clustering methods, where each data point can belong\nto several clusters, are an interesting alternative. Yet, seeding iterative\nfuzzy algorithms to achieve high quality clustering is an issue. In this paper,\nwe propose a new linear and efficient initialization algorithm MaxMin Linear to\ndeal with this problem. Then, we validate our theoretical results through\nextensive experiments on a variety of numerical real-world and artificial\ndatasets. We also test several validity indices, including a new validity index\nthat we propose, Transformed Standardized Fuzzy Difference (TSFD).\n","authors":"Aybükë Oztürk|Stéphane Lallich|Jérôme Darmont|Sylvie Yona Waksman","affiliations":"ERIC, ArAr|ERIC|ERIC|ArAr","link_abstract":"http://arxiv.org/abs/1808.00197v1","link_pdf":"http://arxiv.org/pdf/1808.00197v1","link_doi":"","comment":"","journal_ref":"IBaI. 14th International Conference on Machine Learning and Data\n Mining (MLDM 2018), Jul 2018, New York, United States. Springer, Lecture\n Notes in Artificial Intelligence, 10934-10935, 2018, Machine Learning and\n Data Mining in Pattern Recognition. http://www.mldm.de","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.DB|stat.ML"} {"id":"1808.00616v1","submitted":"2018-08-02 01:09:24","updated":"2018-08-02 01:09:24","title":"Mixture Matrix Completion","abstract":" Completing a data matrix X has become an ubiquitous problem in modern data\nscience, with applications in recommender systems, computer vision, and\nnetworks inference, to name a few. One typical assumption is that X is\nlow-rank. A more general model assumes that each column of X corresponds to one\nof several low-rank matrices. This paper generalizes these models to what we\ncall mixture matrix completion (MMC): the case where each entry of X\ncorresponds to one of several low-rank matrices. MMC is a more accurate model\nfor recommender systems, and brings more flexibility to other completion and\nclustering problems. We make four fundamental contributions about this new\nmodel. First, we show that MMC is theoretically possible (well-posed). Second,\nwe give its precise information-theoretic identifiability conditions. Third, we\nderive the sample complexity of MMC. Finally, we give a practical algorithm for\nMMC with performance comparable to the state-of-the-art for simpler related\nproblems, both on synthetic and real data.\n","authors":"Daniel L. Pimentel-Alarcón","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.00616v1","link_pdf":"http://arxiv.org/pdf/1808.00616v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1808.01091v1","submitted":"2018-08-03 05:51:46","updated":"2018-08-03 05:51:46","title":"DataDeps.jl: Repeatable Data Setup for Replicable Data Science","abstract":" We present DataDeps.jl: a julia package for the reproducible handling of\nstatic datasets to enhance the repeatability of scripts used in the data and\ncomputational sciences. It is used to automate the data setup part of running\nsoftware which accompanies a paper to replicate a result. This step is commonly\ndone manually, which expends time and allows for confusion. This functionality\nis also useful for other packages which require data to function (e.g. a\ntrained machine learning based model). DataDeps.jl simplifies extending\nresearch software by automatically managing the dependencies and makes it\neasier to run another author's code, thus enhancing the reproducibility of data\nscience research.\n","authors":"Lyndon White|Roberto Togneri|Wei Liu|Mohammed Bennamoun","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.01091v1","link_pdf":"http://arxiv.org/pdf/1808.01091v1","link_doi":"","comment":"Source code: https://github.com/oxinabox/DataDeps.jl/","journal_ref":"","doi":"","primary_category":"cs.SE","categories":"cs.SE"} {"id":"1808.01145v1","submitted":"2018-08-03 10:24:32","updated":"2018-08-03 10:24:32","title":"Hoeffding Trees with nmin adaptation","abstract":" Machine learning software accounts for a significant amount of energy\nconsumed in data centers. These algorithms are usually optimized towards\npredictive performance, i.e. accuracy, and scalability. This is the case of\ndata stream mining algorithms. Although these algorithms are adaptive to the\nincoming data, they have fixed parameters from the beginning of the execution.\nWe have observed that having fixed parameters lead to unnecessary computations,\nthus making the algorithm energy inefficient. In this paper we present the nmin\nadaptation method for Hoeffding trees. This method adapts the value of the nmin\nparameter, which significantly affects the energy consumption of the algorithm.\nThe method reduces unnecessary computations and memory accesses, thus reducing\nthe energy, while the accuracy is only marginally affected. We experimentally\ncompared VFDT (Very Fast Decision Tree, the first Hoeffding tree algorithm) and\nCVFDT (Concept-adapting VFDT) with the VFDT-nmin (VFDT with nmin adaptation).\nThe results show that VFDT-nmin consumes up to 27% less energy than the\nstandard VFDT, and up to 92% less energy than CVFDT, trading off a few percent\nof accuracy in a few datasets.\n","authors":"Eva García-Martín|Niklas Lavesson|Håkan Grahn|Emiliano Casalicchio|Veselka Boeva","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.01145v1","link_pdf":"http://arxiv.org/pdf/1808.01145v1","link_doi":"","comment":"Accepted at: The 5th IEEE International Conference on Data Science\n and Advanced Analytics (DSAA 2018)","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1808.01175v1","submitted":"2018-08-03 12:57:15","updated":"2018-08-03 12:57:15","title":"Content-driven, unsupervised clustering of news articles through\n multiscale graph partitioning","abstract":" The explosion in the amount of news and journalistic content being generated\nacross the globe, coupled with extended and instantaneous access to information\nthrough online media, makes it difficult and time-consuming to monitor news\ndevelopments and opinion formation in real time. There is an increasing need\nfor tools that can pre-process, analyse and classify raw text to extract\ninterpretable content; specifically, identifying topics and content-driven\ngroupings of articles. We present here such a methodology that brings together\npowerful vector embeddings from Natural Language Processing with tools from\nGraph Theory that exploit diffusive dynamics on graphs to reveal natural\npartitions across scales. Our framework uses a recent deep neural network text\nanalysis methodology (Doc2vec) to represent text in vector form and then\napplies a multi-scale community detection method (Markov Stability) to\npartition a similarity graph of document vectors. The method allows us to\nobtain clusters of documents with similar content, at different levels of\nresolution, in an unsupervised manner. We showcase our approach with the\nanalysis of a corpus of 9,000 news articles published by Vox Media over one\nyear. Our results show consistent groupings of documents according to content\nwithout a priori assumptions about the number or type of clusters to be found.\nThe multilevel clustering reveals a quasi-hierarchy of topics and subtopics\nwith increased intelligibility and improved topic coherence as compared to\nexternal taxonomy services and standard topic detection methods.\n","authors":"M. Tarik Altuncu|Sophia N. Yaliraki|Mauricio Barahona","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.01175v1","link_pdf":"http://arxiv.org/pdf/1808.01175v1","link_doi":"","comment":"8 pages; 5 figures; To present at KDD 2018: Data Science, Journalism\n & Media workshop","journal_ref":"","doi":"","primary_category":"cs.CL","categories":"cs.CL|cs.IR|cs.LG|math.SP"} {"id":"1808.04440v1","submitted":"2018-08-06 09:36:08","updated":"2018-08-06 09:36:08","title":"FaceOff: Anonymizing Videos in the Operating Rooms","abstract":" Video capture in the surgical operating room (OR) is increasingly possible\nand has potential for use with computer assisted interventions (CAI), surgical\ndata science and within smart OR integration. Captured video innately carries\nsensitive information that should not be completely visible in order to\npreserve the patient's and the clinical teams' identities. When surgical video\nstreams are stored on a server, the videos must be anonymized prior to storage\nif taken outside of the hospital. In this article, we describe how a deep\nlearning model, Faster R-CNN, can be used for this purpose and help to\nanonymize video data captured in the OR. The model detects and blurs faces in\nan effort to preserve anonymity. After testing an existing face detection\ntrained model, a new dataset tailored to the surgical environment, with faces\nobstructed by surgical masks and caps, was collected for fine-tuning to achieve\nhigher face-detection rates in the OR. We also propose a temporal\nregularisation kernel to improve recall rates. The fine-tuned model achieves a\nface detection recall of 88.05 % and 93.45 % before and after applying\ntemporal-smoothing respectively.\n","authors":"Evangello Flouty|Odysseas Zisimopoulos|Danail Stoyanov","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.04440v1","link_pdf":"http://arxiv.org/pdf/1808.04440v1","link_doi":"","comment":"MICCAI 2018: OR 2.0 Context-Aware Operating Theaters","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV|cs.AI"} {"id":"1808.02061v4","submitted":"2018-08-06 18:36:12","updated":"2019-09-18 20:40:29","title":"Semblance: A Rank-Based Kernel on Probability Spaces for Niche Detection","abstract":" In data science, determining proximity between observations is critical to\nmany downstream analyses such as clustering, information retrieval and\nclassification. However, when the underlying structure of the data probability\nspace is unclear, the function used to compute similarity between data points\nis often arbitrarily chosen. Here, we present a novel concept of proximity,\nSemblance, that uses the empirical distribution across all observations to\ninform the similarity between each pair. The advantage of Semblance lies in its\ndistribution free formulation and its ability to detect niche features by\nplacing greater emphasis on similarity between observation pairs that fall at\nthe outskirts of the data distribution, as opposed to those that fall towards\nthe center. We prove that Semblance is a valid Mercer kernel, thus allowing its\nprincipled use in kernel based learning machines. Semblance can be applied to\nany data modality, and we demonstrate its consistently improved performance\nagainst conventional methods through simulations and three real case studies\nfrom very different applications, viz. cell type classification using single\ncell RNA sequencing, selecting predictors of positive return on real estate\ninvestments, and image compression.\n","authors":"Divyansh Agarwal|Nancy R. Zhang","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.02061v4","link_pdf":"http://arxiv.org/pdf/1808.02061v4","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML"} {"id":"1808.02129v2","submitted":"2018-08-06 21:44:06","updated":"2018-08-29 23:31:44","title":"Probabilistic Causal Analysis of Social Influence","abstract":" Mastering the dynamics of social influence requires separating, in a database\nof information propagation traces, the genuine causal processes from temporal\ncorrelation, i.e., homophily and other spurious causes. However, most studies\nto characterize social influence, and, in general, most data-science analyses\nfocus on correlations, statistical independence, or conditional independence.\nOnly recently, there has been a resurgence of interest in \"causal data\nscience\", e.g., grounded on causality theories. In this paper we adopt a\nprincipled causal approach to the analysis of social influence from\ninformation-propagation data, rooted in the theory of probabilistic causation.\n Our approach consists of two phases. In the first one, in order to avoid the\npitfalls of misinterpreting causation when the data spans a mixture of several\nsubtypes (\"Simpson's paradox\"), we partition the set of propagation traces into\ngroups, in such a way that each group is as less contradictory as possible in\nterms of the hierarchical structure of information propagation. To achieve this\ngoal, we borrow the notion of \"agony\" and define the Agony-bounded Partitioning\nproblem, which we prove being hard, and for which we develop two efficient\nalgorithms with approximation guarantees. In the second phase, for each group\nfrom the first phase, we apply a constrained MLE approach to ultimately learn a\nminimal causal topology. Experiments on synthetic data show that our method is\nable to retrieve the genuine causal arcs w.r.t. a ground-truth generative\nmodel. Experiments on real data show that, by focusing only on the extracted\ncausal structures instead of the whole social graph, the effectiveness of\npredicting influence spread is significantly improved.\n","authors":"Francesco Bonchi|Francesco Gullo|Bud Mishra|Daniele Ramazzotti","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.02129v2","link_pdf":"http://arxiv.org/pdf/1808.02129v2","link_doi":"","comment":"","journal_ref":"CIKM 18, October 22-26, 2018, Torino, Italy","doi":"","primary_category":"cs.SI","categories":"cs.SI|cs.LG|physics.soc-ph|stat.ML"} {"id":"1808.03265v2","submitted":"2018-08-09 04:08:46","updated":"2019-03-18 18:52:59","title":"A Hybrid Recommender System for Patient-Doctor Matchmaking in Primary\n Care","abstract":" We partner with a leading European healthcare provider and design a mechanism\nto match patients with family doctors in primary care. We define the\nmatchmaking process for several distinct use cases given different levels of\navailable information about patients. Then, we adopt a hybrid recommender\nsystem to present each patient a list of family doctor recommendations. In\nparticular, we model patient trust of family doctors using a large-scale\ndataset of consultation histories, while accounting for the temporal dynamics\nof their relationships. Our proposed approach shows higher predictive accuracy\nthan both a heuristic baseline and a collaborative filtering approach, and the\nproposed trust measure further improves model performance.\n","authors":"Qiwei Han|Mengxin Ji|Inigo Martinez de Rituerto de Troya|Manas Gaur|Leid Zejnilovic","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.03265v2","link_pdf":"http://arxiv.org/pdf/1808.03265v2","link_doi":"","comment":"This paper is accepted at DSAA 2018 as a full paper, Proc. of the 5th\n IEEE International Conference on Data Science and Advanced Analytics (DSAA),\n Turin, Italy","journal_ref":"","doi":"","primary_category":"cs.IR","categories":"cs.IR|cs.LG|stat.ML"} {"id":"1808.04580v2","submitted":"2018-08-14 08:24:01","updated":"2018-12-12 12:07:04","title":"NFFT meets Krylov methods: Fast matrix-vector products for the graph\n Laplacian of fully connected networks","abstract":" The graph Laplacian is a standard tool in data science, machine learning, and\nimage processing. The corresponding matrix inherits the complex structure of\nthe underlying network and is in certain applications densely populated. This\nmakes computations, in particular matrix-vector products, with the graph\nLaplacian a hard task. A typical application is the computation of a number of\nits eigenvalues and eigenvectors. Standard methods become infeasible as the\nnumber of nodes in the graph is too large. We propose the use of the fast\nsummation based on the nonequispaced fast Fourier transform (NFFT) to perform\nthe dense matrix-vector product with the graph Laplacian fast without ever\nforming the whole matrix. The enormous flexibility of the NFFT algorithm allows\nus to embed the accelerated multiplication into Lanczos-based eigenvalues\nroutines or iterative linear system solvers and even consider other than the\nstandard Gaussian kernels. We illustrate the feasibility of our approach on a\nnumber of test problems from image segmentation to semi-supervised learning\nbased on graph-based PDEs. In particular, we compare our approach with the\nNystr\\\"om method. Moreover, we present and test an enhanced, hybrid version of\nthe Nystr\\\"om method, which internally uses the NFFT.\n","authors":"Dominik Alfke|Daniel Potts|Martin Stoll|Toni Volkmer","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.04580v2","link_pdf":"http://arxiv.org/pdf/1808.04580v2","link_doi":"http://dx.doi.org/10.3389/fams.2018.00061","comment":"28 pages, 9 figures","journal_ref":"","doi":"10.3389/fams.2018.00061","primary_category":"cs.LG","categories":"cs.LG|math.NA|stat.ML|68R10, 05C50, 65F15, 65T50, 68T05, 62H30"} {"id":"1808.04753v2","submitted":"2018-08-14 15:31:56","updated":"2019-10-15 23:20:16","title":"Estimating the size of a hidden finite set: large-sample behavior of\n estimators","abstract":" A finite set is \"hidden\" if its elements are not directly enumerable or if\nits size cannot be ascertained via a deterministic query. In public health,\nepidemiology, demography, ecology and intelligence analysis, researchers have\ndeveloped a wide variety of indirect statistical approaches, under different\nmodels for sampling and observation, for estimating the size of a hidden set.\nSome methods make use of random sampling with known or estimable sampling\nprobabilities, and others make structural assumptions about relationships (e.g.\nordering or network information) between the elements that comprise the hidden\nset. In this review, we describe models and methods for learning about the size\nof a hidden finite set, with special attention to asymptotic properties of\nestimators. We study the properties of these methods under two asymptotic\nregimes, \"infill\" in which the number of fixed-size samples increases, but the\npopulation size remains constant, and \"outfill\" in which the sample size and\npopulation size grow together. Statistical properties under these two regimes\ncan be dramatically different.\n","authors":"Si Cheng|Daniel J. Eck|Forrest W. Crawford","affiliations":"Department of Biostatistics, University of Washington|Department of Statistics, University of Illinois Urbana-Champaign|Department of Biostatistics, Yale School of Public Health","link_abstract":"http://arxiv.org/abs/1808.04753v2","link_pdf":"http://arxiv.org/pdf/1808.04753v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|stat.AP|stat.ME|stat.TH"} {"id":"1808.04849v1","submitted":"2018-08-14 18:31:22","updated":"2018-08-14 18:31:22","title":"A Scalable Data Science Platform for Healthcare and Precision Medicine\n Research","abstract":" Objective: To (1) demonstrate the implementation of a data science platform\nbuilt on open-source technology within a large, academic healthcare system and\n(2) describe two computational healthcare applications built on such a\nplatform. Materials and Methods: A data science platform based on several open\nsource technologies was deployed to support real-time, big data workloads. Data\nacquisition workflows for Apache Storm and NiFi were developed in Java and\nPython to capture patient monitoring and laboratory data for downstream\nanalytics. Results: The use of emerging data management approaches along with\nopen-source technologies such as Hadoop can be used to create integrated data\nlakes to store large, real-time data sets. This infrastructure also provides a\nrobust analytics platform where healthcare and biomedical research data can be\nanalyzed in near real-time for precision medicine and computational healthcare\nuse cases. Discussion: The implementation and use of integrated data science\nplatforms offer organizations the opportunity to combine traditional data sets,\nincluding data from the electronic health record, with emerging big data\nsources, such as continuous patient monitoring and real-time laboratory\nresults. These platforms can enable cost-effective and scalable analytics for\nthe information that will be key to the delivery of precision medicine\ninitiatives. Conclusion: Organizations that can take advantage of the technical\nadvances found in data science platforms will have the opportunity to provide\ncomprehensive access to healthcare data for computational healthcare and\nprecision medicine research.\n","authors":"Jacob McPadden|Thomas JS Durant|Dustin R Bunch|Andreas Coppi|Nathan Price|Kris Rodgerson|Charles J Torre Jr|William Byron|H Patrick Young|Allen L Hsiao|Harlan M Krumholz|Wade L Schulz","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.04849v1","link_pdf":"http://arxiv.org/pdf/1808.04849v1","link_doi":"","comment":"8 pages, 4 figures, 1 table","journal_ref":"","doi":"","primary_category":"cs.DC","categories":"cs.DC"} {"id":"1808.05329v1","submitted":"2018-08-16 02:58:54","updated":"2018-08-16 02:58:54","title":"Sequential Behavioral Data Processing Using Deep Learning and the Markov\n Transition Field in Online Fraud Detection","abstract":" Due to the popularity of the Internet and smart mobile devices, more and more\nfinancial transactions and activities have been digitalized. Compared to\ntraditional financial fraud detection strategies using credit-related features,\ncustomers are generating a large amount of unstructured behavioral data every\nsecond. In this paper, we propose an Recurrent Neural Netword (RNN) based\ndeep-learning structure integrated with Markov Transition Field (MTF) for\npredicting online fraud behaviors using customer's interactions with websites\nor smart-phone apps as a series of states. In practice, we tested and proved\nthat the proposed network structure for processing sequential behavioral data\ncould significantly boost fraud predictive ability comparing with the\nmultilayer perceptron network and distance based classifier with Dynamic Time\nWarping(DTW) as distance metric.\n","authors":"Ruinan Zhang|Fanglan Zheng|Wei Min","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.05329v1","link_pdf":"http://arxiv.org/pdf/1808.05329v1","link_doi":"","comment":"KDD2018 Data Science in Fintech Workshop Paper","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.IR|stat.ML"} {"id":"1808.06492v1","submitted":"2018-08-17 02:15:39","updated":"2018-08-17 02:15:39","title":"Benchmarking Automatic Machine Learning Frameworks","abstract":" AutoML serves as the bridge between varying levels of expertise when\ndesigning machine learning systems and expedites the data science process. A\nwide range of techniques is taken to address this, however there does not exist\nan objective comparison of these techniques. We present a benchmark of current\nopen source AutoML solutions using open source datasets. We test auto-sklearn,\nTPOT, auto_ml, and H2O's AutoML solution against a compiled set of regression\nand classification datasets sourced from OpenML and find that auto-sklearn\nperforms the best across classification datasets and TPOT performs the best\nacross regression datasets.\n","authors":"Adithya Balaji|Alexander Allen","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.06492v1","link_pdf":"http://arxiv.org/pdf/1808.06492v1","link_doi":"","comment":"9 pages, 8 figures, 5 tables","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.AI|stat.ML"} {"id":"1808.06883v1","submitted":"2018-08-21 13:02:27","updated":"2018-08-21 13:02:27","title":"Population synthesis of binary stars","abstract":" Many aspects of the evolution of stars, and in particular the evolution of\nbinary stars, remain beyond our ability to model them in detail. Instead, we\nrely on observations to guide our often phenomenological models and pin down\nuncertain model parameters. To do this statistically requires population\nsynthesis. Populations of stars modelled on computers are compared to\npopulations of stars observed with our best telescopes. The closest match\nbetween observations and models provides insight into unknown model parameters\nand hence the underlying astrophysics. In this brief review, we describe the\nimpact that modern big-data surveys will have on population synthesis, the\nlarge parameter space problem that is rife for the application of modern data\nscience algorithms, and some examples of how population synthesis is relevant\nto modern astrophysics.\n","authors":"Robert G. Izzard|Ghina M. Halabi","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.06883v1","link_pdf":"http://arxiv.org/pdf/1808.06883v1","link_doi":"","comment":"Accepted for publication by Cambridge University Press as book\n chapter in 'The Impact of Binary Stars on Stellar Evolution' by Giacomo\n Beccari and Henri Boffin. 19 pages, 2 figures","journal_ref":"","doi":"","primary_category":"astro-ph.SR","categories":"astro-ph.SR"} {"id":"1808.07452v2","submitted":"2018-08-22 17:36:08","updated":"2019-01-22 00:23:23","title":"Generalized Canonical Polyadic Tensor Decomposition","abstract":" Tensor decomposition is a fundamental unsupervised machine learning method in\ndata science, with applications including network analysis and sensor data\nprocessing. This work develops a generalized canonical polyadic (GCP) low-rank\ntensor decomposition that allows other loss functions besides squared error.\nFor instance, we can use logistic loss or Kullback-Leibler divergence, enabling\ntensor decomposition for binary or count data. We present a variety\nstatistically-motivated loss functions for various scenarios. We provide a\ngeneralized framework for computing gradients and handling missing data that\nenables the use of standard optimization methods for fitting the model. We\ndemonstrate the flexibility of GCP on several real-world examples including\ninteractions in a social network, neural activity in a mouse, and monthly\nrainfall measurements in India.\n","authors":"David Hong|Tamara G. Kolda|Jed A. Duersch","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.07452v2","link_pdf":"http://arxiv.org/pdf/1808.07452v2","link_doi":"http://dx.doi.org/10.1137/18M1203626","comment":"","journal_ref":"SIAM Review, Vol. 62, No. 1, pp. 133-163, 2020","doi":"10.1137/18M1203626","primary_category":"math.NA","categories":"math.NA|cs.LG"} {"id":"1808.08765v2","submitted":"2018-08-27 10:04:36","updated":"2019-03-28 08:22:30","title":"Identifiability of Complete Dictionary Learning","abstract":" Sparse component analysis (SCA), also known as complete dictionary learning,\nis the following problem: Given an input matrix $M$ and an integer $r$, find a\ndictionary $D$ with $r$ columns and a matrix $B$ with $k$-sparse columns (that\nis, each column of $B$ has at most $k$ non-zero entries) such that $M \\approx\nDB$. A key issue in SCA is identifiability, that is, characterizing the\nconditions under which $D$ and $B$ are essentially unique (that is, they are\nunique up to permutation and scaling of the columns of $D$ and rows of $B$).\nAlthough SCA has been vastly investigated in the last two decades, only a few\nworks have tackled this issue in the deterministic scenario, and no work\nprovides reasonable bounds in the minimum number of samples (that is, columns\nof $M$) that leads to identifiability. In this work, we provide new results in\nthe deterministic scenario when the data has a low-rank structure, that is,\nwhen $D$ is (under)complete. While previous bounds feature a combinatorial term\n$r \\choose k$, we exhibit a sufficient condition involving\n$\\mathcal{O}(r^3/(r-k)^2)$ samples that yields an essentially unique\ndecomposition, as long as these data points are well spread among the subspaces\nspanned by $r-1$ columns of $D$. We also exhibit a necessary lower bound on the\nnumber of samples that contradicts previous results in the literature when $k$\nequals $r-1$. Our bounds provide a drastic improvement compared to the state of\nthe art, and imply for example that for a fixed proportion of zeros (constant\nand independent of $r$, e.g., 10\\% of zero entries in $B$), one only requires\n$\\mathcal{O}(r)$ data points to guarantee identifiability.\n","authors":"Jérémy E. Cohen|Nicolas Gillis","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.08765v2","link_pdf":"http://arxiv.org/pdf/1808.08765v2","link_doi":"http://dx.doi.org/10.1137/18M1233339","comment":"19 pages, 2 figures, new title, added references and discussions","journal_ref":"SIAM Journal on Mathematics of Data Science 1 (3), pp. 518-536,\n 2019","doi":"10.1137/18M1233339","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1809.00984v1","submitted":"2018-08-29 06:53:06","updated":"2018-08-29 06:53:06","title":"The materials data ecosystem: materials data science and its role in\n data-driven materials discovery","abstract":" Since its launch in 2011, Materials Genome Initiative (MGI) has drawn the\nattention of researchers from across academia, government, and industry\nworldwide.As one of the three tools of MGI, the materials data, for the first\ntime, emerged as an extremely significant approach in materials discovery. Data\nscience has been applied in different disciplines as an interdisciplinary field\nto extract knowledge from the data. The concept of materials data science was\nutilized to demonstrate the data application in materials science. To explore\nits potential as an active research branch in the big data age, a three-tier\nsystem was put forward to define the infrastructure of data classification,\ncuration and knowledge extraction of materials data.\n","authors":"Hai-Qing Yin|Xue Jiang|Guo-Quan Liu|Sharon Elder|Bin Xu1|Qing-Jun Zheng|Xuan-Hui Qu","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.00984v1","link_pdf":"http://arxiv.org/pdf/1809.00984v1","link_doi":"http://dx.doi.org/10.1088/1674-1056/27/11/118101","comment":"","journal_ref":"","doi":"10.1088/1674-1056/27/11/118101","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1808.10653v2","submitted":"2018-08-31 09:55:06","updated":"2018-10-02 14:34:32","title":"An Empirical Analysis of the Role of Amplifiers, Downtoners, and\n Negations in Emotion Classification in Microblogs","abstract":" The effect of amplifiers, downtoners, and negations has been studied in\ngeneral and particularly in the context of sentiment analysis. However, there\nis only limited work which aims at transferring the results and methods to\ndiscrete classes of emotions, e. g., joy, anger, fear, sadness, surprise, and\ndisgust. For instance, it is not straight-forward to interpret which emotion\nthe phrase \"not happy\" expresses. With this paper, we aim at obtaining a better\nunderstanding of such modifiers in the context of emotion-bearing words and\ntheir impact on document-level emotion classification, namely, microposts on\nTwitter. We select an appropriate scope detection method for modifiers of\nemotion words, incorporate it in a document-level emotion classification model\nas additional bag of words and show that this approach improves the performance\nof emotion classification. In addition, we build a term weighting approach\nbased on the different modifiers into a lexical model for the analysis of the\nsemantics of modifiers and their impact on emotion meaning. We show that\namplifiers separate emotions expressed with an emotion- bearing word more\nclearly from other secondary connotations. Downtoners have the opposite effect.\nIn addition, we discuss the meaning of negations of emotion-bearing words. For\ninstance we show empirically that \"not happy\" is closer to sadness than to\nanger and that fear-expressing words in the scope of downtoners often express\nsurprise.\n","authors":"Florian Strohm|Roman Klinger","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.10653v2","link_pdf":"http://arxiv.org/pdf/1808.10653v2","link_doi":"http://dx.doi.org/10.1109/DSAA.2018.00087","comment":"Accepted for publication at The 5th IEEE International Conference on\n Data Science and Advanced Analytics (DSAA), https://dsaa2018.isi.it/","journal_ref":"","doi":"10.1109/DSAA.2018.00087","primary_category":"cs.CL","categories":"cs.CL"} {"id":"1808.10685v2","submitted":"2018-08-31 11:20:53","updated":"2019-04-24 11:18:07","title":"Extracting Keywords from Open-Ended Business Survey Questions","abstract":" Open-ended survey data constitute an important basis in research as well as\nfor making business decisions. Collecting and manually analysing free-text\nsurvey data is generally more costly than collecting and analysing survey data\nconsisting of answers to multiple-choice questions. Yet free-text data allow\nfor new content to be expressed beyond predefined categories and are a very\nvaluable source of new insights into people's opinions. At the same time,\nsurveys always make ontological assumptions about the nature of the entities\nthat are researched, and this has vital ethical consequences. Human\ninterpretations and opinions can only be properly ascertained in their richness\nusing textual data sources; if these sources are analyzed appropriately, the\nessential linguistic nature of humans and social entities is safeguarded.\nNatural Language Processing (NLP) offers possibilities for meeting this ethical\nbusiness challenge by automating the analysis of natural language and thus\nallowing for insightful investigations of human judgements. We present a\ncomputational pipeline for analysing large amounts of responses to open-ended\nquestions in surveys and extract keywords that appropriately represent people's\nopinions. This pipeline addresses the need to perform such tasks outside the\nscope of both commercial software and bespoke analysis, exceeds the performance\nto state-of-the-art systems, and performs this task in a transparent way that\nallows for scrutinising and exposing potential biases in the analysis.\nFollowing the principle of Open Data Science, our code is open-source and\ngeneralizable to other datasets.\n","authors":"Barbara McGillivray|Gard Jenset|Dominik Heil","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.10685v2","link_pdf":"http://arxiv.org/pdf/1808.10685v2","link_doi":"","comment":"1 figure","journal_ref":"Journal of Data Mining & Digital Humanities, 2020, Project (March\n 17, 2020) jdmdh:5398","doi":"","primary_category":"cs.CL","categories":"cs.CL"} {"id":"1808.10862v1","submitted":"2018-08-31 17:43:21","updated":"2018-08-31 17:43:21","title":"Open Source Dataset and Machine Learning Techniques for Automatic\n Recognition of Historical Graffiti","abstract":" Machine learning techniques are presented for automatic recognition of the\nhistorical letters (XI-XVIII centuries) carved on the stoned walls of St.Sophia\ncathedral in Kyiv (Ukraine). A new image dataset of these carved Glagolitic and\nCyrillic letters (CGCL) was assembled and pre-processed for recognition and\nprediction by machine learning methods. The dataset consists of more than 4000\nimages for 34 types of letters. The explanatory data analysis of CGCL and\nnotMNIST datasets shown that the carved letters can hardly be differentiated by\ndimensionality reduction methods, for example, by t-distributed stochastic\nneighbor embedding (tSNE) due to the worse letter representation by stone\ncarving in comparison to hand writing. The multinomial logistic regression\n(MLR) and a 2D convolutional neural network (CNN) models were applied. The MLR\nmodel demonstrated the area under curve (AUC) values for receiver operating\ncharacteristic (ROC) are not lower than 0.92 and 0.60 for notMNIST and CGCL,\nrespectively. The CNN model gave AUC values close to 0.99 for both notMNIST and\nCGCL (despite the much smaller size and quality of CGCL in comparison to\nnotMNIST) under condition of the high lossy data augmentation. CGCL dataset was\npublished to be available for the data science community as an open source\nresource.\n","authors":"Nikita Gordienko|Peng Gang|Yuri Gordienko|Wei Zeng|Oleg Alienin|Oleksandr Rokovyi|Sergii Stirenko","affiliations":"","link_abstract":"http://arxiv.org/abs/1808.10862v1","link_pdf":"http://arxiv.org/pdf/1808.10862v1","link_doi":"http://dx.doi.org/10.1007/978-3-030-04221-9_37","comment":"11 pages, 9 figures, accepted for 25th International Conference on\n Neural Information Processing (ICONIP 2018), 14-16 December, 2018 (Siem Reap,\n Cambodia)","journal_ref":"In: Cheng L., Leung A., Ozawa S. (eds) Neural Information\n Processing. ICONIP 2018. Lecture Notes in Computer Science, vol. 11305, pp.\n 414-424. Springer, Cham","doi":"10.1007/978-3-030-04221-9_37","primary_category":"cs.LG","categories":"cs.LG|cs.CV|cs.CY|stat.ML"} {"id":"1809.01062v1","submitted":"2018-09-04 16:09:15","updated":"2018-09-04 16:09:15","title":"JobComposer: Career Path Optimization via Multicriteria Utility Learning","abstract":" With online professional network platforms (OPNs, e.g., LinkedIn, Xing, etc.)\nbecoming popular on the web, people are now turning to these platforms to\ncreate and share their professional profiles, to connect with others who share\nsimilar professional aspirations and to explore new career opportunities. These\nplatforms however do not offer a long-term roadmap to guide career progression\nand improve workforce employability. The career trajectories of OPN users can\nserve as a reference but they are not always optimal. A career plan can also be\ndevised through consultation with career coaches, whose knowledge may however\nbe limited to a few industries. To address the above limitations, we present a\nnovel data-driven approach dubbed JobComposer to automate career path planning\nand optimization. Its key premise is that the observed career trajectories in\nOPNs may not necessarily be optimal, and can be improved by learning to\nmaximize the sum of payoffs attainable by following a career path. At its\nheart, JobComposer features a decomposition-based multicriteria utility\nlearning procedure to achieve the best tradeoff among different payoff criteria\nin career path planning. Extensive studies using a city state-based OPN dataset\ndemonstrate that JobComposer returns career paths better than other baseline\nmethods and the actual career paths.\n","authors":"Richard J. Oentaryo|Xavier Jayaraj Siddarth Ashok|Ee-Peng Lim|Philips Kokoh Prasetyo","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.01062v1","link_pdf":"http://arxiv.org/pdf/1809.01062v1","link_doi":"","comment":"","journal_ref":"ECML-PKDD Data Science for Human Capital Management 2018","doi":"","primary_category":"cs.SI","categories":"cs.SI|cs.LG"} {"id":"1809.02269v3","submitted":"2018-09-07 01:26:15","updated":"2019-05-28 02:28:39","title":"edge2vec: Representation learning using edge semantics for biomedical\n knowledge discovery","abstract":" Representation learning provides new and powerful graph analytical approaches\nand tools for the highly valued data science challenge of mining knowledge\ngraphs. Since previous graph analytical methods have mostly focused on\nhomogeneous graphs, an important current challenge is extending this\nmethodology for richly heterogeneous graphs and knowledge domains. The\nbiomedical sciences are such a domain, reflecting the complexity of biology,\nwith entities such as genes, proteins, drugs, diseases, and phenotypes, and\nrelationships such as gene co-expression, biochemical regulation, and\nbiomolecular inhibition or activation. Therefore, the semantics of edges and\nnodes are critical for representation learning and knowledge discovery in real\nworld biomedical problems. In this paper, we propose the edge2vec model, which\nrepresents graphs considering edge semantics. An edge-type transition matrix is\ntrained by an Expectation-Maximization approach, and a stochastic gradient\ndescent model is employed to learn node embedding on a heterogeneous graph via\nthe trained transition matrix. edge2vec is validated on three biomedical domain\ntasks: biomedical entity classification, compound-gene bioactivity prediction,\nand biomedical information retrieval. Results show that by considering\nedge-types into node embedding learning in heterogeneous graphs,\n\\textbf{edge2vec}\\ significantly outperforms state-of-the-art models on all\nthree tasks. We propose this method for its added value relative to existing\ngraph analytical methodology, and in the real world context of biomedical\nknowledge discovery applicability.\n","authors":"Zheng Gao|Gang Fu|Chunping Ouyang|Satoshi Tsutsui|Xiaozhong Liu|Jeremy Yang|Christopher Gessner|Brian Foote|David Wild|Qi Yu|Ying Ding","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.02269v3","link_pdf":"http://arxiv.org/pdf/1809.02269v3","link_doi":"","comment":"10 pages","journal_ref":"","doi":"","primary_category":"cs.IR","categories":"cs.IR"} {"id":"1809.02408v2","submitted":"2018-09-07 11:26:51","updated":"2019-03-05 04:38:35","title":"A Primer on Causality in Data Science","abstract":" Many questions in Data Science are fundamentally causal in that our objective\nis to learn the effect of some exposure, randomized or not, on an outcome\ninterest. Even studies that are seemingly non-causal, such as those with the\ngoal of prediction or prevalence estimation, have causal elements, including\ndifferential censoring or measurement. As a result, we, as Data Scientists,\nneed to consider the underlying causal mechanisms that gave rise to the data,\nrather than simply the pattern or association observed in those data. In this\nwork, we review the 'Causal Roadmap' of Petersen and van der Laan (2014) to\nprovide an introduction to some key concepts in causal inference. Similar to\nother causal frameworks, the steps of the Roadmap include clearly stating the\nscientific question, defining of the causal model, translating the scientific\nquestion into a causal parameter, assessing the assumptions needed to express\nthe causal parameter as a statistical estimand, implementation of statistical\nestimators including parametric and semi-parametric methods, and interpretation\nof our findings. We believe that using such a framework in Data Science will\nhelp to ensure that our statistical analyses are guided by the scientific\nquestion driving our research, while avoiding over-interpreting our results. We\nfocus on the effect of an exposure occurring at a single time point and\nhighlight the use of targeted maximum likelihood estimation (TMLE) with Super\nLearner.\n","authors":"Hachem Saddiki|Laura B. Balzer","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.02408v2","link_pdf":"http://arxiv.org/pdf/1809.02408v2","link_doi":"","comment":"26 pages (with references); 4 figures","journal_ref":"","doi":"","primary_category":"stat.AP","categories":"stat.AP|stat.ME|stat.ML"} {"id":"1809.03539v1","submitted":"2018-09-10 18:34:22","updated":"2018-09-10 18:34:22","title":"Annotating shadows, highlights and faces: the contribution of a 'human\n in the loop' for digital art history","abstract":" While automatic computational techniques appear to reveal novel insights in\ndigital art history, a complementary approach seems to get less attention: that\nof human annotation. We argue and exemplify that a 'human in the loop' can\nreveal insights that may be difficult to detect automatically. Specifically, we\nfocussed on perceptual aspects within pictorial art. Using rather simple\nannotation tasks (e.g. delineate human lengths, indicate highlights and\nclassify gaze direction) we could both replicate earlier findings and reveal\nnovel insights into pictorial conventions. We found that Canaletto depicted\nhuman figures in rather accurate perspective, varied viewpoint elevation\nbetween approximately 3 and 9 meters and highly preferred light directions\nparallel to the projection plane. Furthermore, we found that taking the\naveraged images of leftward looking faces reveals a woman, and for rightward\nlooking faces showed a male, confirming earlier accounts on lateral gender bias\nin pictorial art. Lastly, we confirmed and refined the well-known\nlight-from-the-left bias. Together, the annotations, analyses and results\nexemplify how human annotation can contribute and complement to technical and\ndigital art history.\n","authors":"Maarten W. A. Wijntjes","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.03539v1","link_pdf":"http://arxiv.org/pdf/1809.03539v1","link_doi":"","comment":"Presented at the \"1st KDD Workshop on Data Science for Digital Art\n History: tackling big data Challenges, Algorithms, and Systems\", see\n http://dsdah2018.blogs.dsv.su.se for more info. Manuscript should eventually\n be published in Journal of Digital Art History (www.dah-journal.org/)","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV|cs.HC"} {"id":"1809.04399v1","submitted":"2018-09-12 13:12:07","updated":"2018-09-12 13:12:07","title":"Artificial Intelligence for the Public Sector: Opportunities and\n challenges of cross-sector collaboration","abstract":" Public sector organisations are increasingly interested in using data science\nand artificial intelligence capabilities to deliver policy and generate\nefficiencies in high uncertainty environments. The long-term success of data\nscience and AI in the public sector relies on effectively embedding it into\ndelivery solutions for policy implementation. However, governments cannot do\nthis integration of AI into public service delivery on their own. The UK\nGovernment Industrial Strategy is clear that delivering on the AI grand\nchallenge requires collaboration between universities and public and private\nsectors. This cross-sectoral collaborative approach is the norm in applied AI\ncentres of excellence around the world. Despite their popularity, cross-sector\ncollaborations entail serious management challenges that hinder their success.\nIn this article we discuss the opportunities and challenges from AI for public\nsector. Finally, we propose a series of strategies to successfully manage these\ncross-sectoral collaborations.\n","authors":"Slava Jankin Mikhaylov|Marc Esteve|Averill Campion","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.04399v1","link_pdf":"http://arxiv.org/pdf/1809.04399v1","link_doi":"http://dx.doi.org/10.1098/rsta.2017.0357","comment":"","journal_ref":"Philosophical Transactions of the Royal Society A, 2018, Volume\n 376, Issue 2128","doi":"10.1098/rsta.2017.0357","primary_category":"cs.AI","categories":"cs.AI|cs.CY"} {"id":"1809.04559v3","submitted":"2018-09-12 16:51:18","updated":"2019-01-17 12:40:35","title":"Benchmarking and Optimization of Gradient Boosting Decision Tree\n Algorithms","abstract":" Gradient boosting decision trees (GBDTs) have seen widespread adoption in\nacademia, industry and competitive data science due to their state-of-the-art\nperformance in many machine learning tasks. One relative downside to these\nmodels is the large number of hyper-parameters that they expose to the\nend-user. To maximize the predictive power of GBDT models, one must either\nmanually tune the hyper-parameters, or utilize automated techniques such as\nthose based on Bayesian optimization. Both of these approaches are\ntime-consuming since they involve repeatably training the model for different\nsets of hyper-parameters. A number of software GBDT packages have started to\noffer GPU acceleration which can help to alleviate this problem. In this paper,\nwe consider three such packages: XGBoost, LightGBM and Catboost. Firstly, we\nevaluate the performance of the GPU acceleration provided by these packages\nusing large-scale datasets with varying shapes, sparsities and learning tasks.\nThen, we compare the packages in the context of hyper-parameter optimization,\nboth in terms of how quickly each package converges to a good validation score,\nand in terms of generalization performance.\n","authors":"Andreea Anghel|Nikolaos Papandreou|Thomas Parnell|Alessandro De Palma|Haralampos Pozidis","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.04559v3","link_pdf":"http://arxiv.org/pdf/1809.04559v3","link_doi":"","comment":"Workshop on Systems for ML and Open Source Software at NeurIPS 2018,\n Montreal, Canada","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1809.05450v1","submitted":"2018-09-14 14:55:13","updated":"2018-09-14 14:55:13","title":"User preferences in Bayesian multi-objective optimization: the expected\n weighted hypervolume improvement criterion","abstract":" In this article, we present a framework for taking into account user\npreferences in multi-objective Bayesian optimization in the case where the\nobjectives are expensive-to-evaluate black-box functions. A novel expected\nimprovement criterion to be used within Bayesian optimization algorithms is\nintroduced. This criterion, which we call the expected weighted hypervolume\nimprovement (EWHI) criterion, is a generalization of the popular expected\nhypervolume improvement to the case where the hypervolume of the dominated\nregion is defined using an absolutely continuous measure instead of the\nLebesgue measure. The EWHI criterion takes the form of an integral for which no\nclosed form expression exists in the general case. To deal with its\ncomputation, we propose an importance sampling approximation method. A sampling\ndensity that is optimal for the computation of the EWHI for a predefined set of\npoints is crafted and a sequential Monte-Carlo (SMC) approach is used to obtain\na sample approximately distributed from this density. The ability of the\ncriterion to produce optimization strategies oriented by user preferences is\ndemonstrated on a simple bi-objective test problem in the cases of a preference\nfor one objective and of a preference for certain regions of the Pareto front.\n","authors":"Paul Feliot|Julien Bect|Emmanuel Vazquez","affiliations":"L2S, GdR MASCOT-NUM|L2S, GdR MASCOT-NUM|L2S, GdR MASCOT-NUM","link_abstract":"http://arxiv.org/abs/1809.05450v1","link_pdf":"http://arxiv.org/pdf/1809.05450v1","link_doi":"","comment":"To be published in the proceedings of LOD 2018 -- The Fourth\n International Conference on Machine Learning, Optimization, and Data Science\n -- September 13-16, 2018 -- Volterra, Tuscany, Italy","journal_ref":"","doi":"","primary_category":"math.OC","categories":"math.OC|math.ST|stat.TH"} {"id":"1809.05596v1","submitted":"2018-09-14 21:28:21","updated":"2018-09-14 21:28:21","title":"The Generic Holdout: Preventing False-Discoveries in Adaptive Data\n Science","abstract":" Adaptive data analysis has posed a challenge to science due to its ability to\ngenerate false hypotheses on moderately large data sets. In general, with\nnon-adaptive data analyses (where queries to the data are generated without\nbeing influenced by answers to previous queries) a data set containing $n$\nsamples may support exponentially many queries in $n$. This number reduces to\nlinearly many under naive adaptive data analysis, and even sophisticated\nremedies such as the Reusable Holdout (Dwork et. al 2015) only allow\nquadratically many queries in $n$.\n In this work, we propose a new framework for adaptive science which\nexponentially improves on this number of queries under a restricted yet\nscientifically relevant setting, where the goal of the scientist is to find a\nsingle (or a few) true hypotheses about the universe based on the samples. Such\na setting may describe the search for predictive factors of some disease based\non medical data, where the analyst may wish to try a number of predictive\nmodels until a satisfactory one is found.\n Our solution, the Generic Holdout, involves two simple ingredients: (1) a\npartitioning of the data into a exploration set and a holdout set and (2) a\nlimited exposure strategy for the holdout set. An analyst is free to use the\nexploration set arbitrarily, but when testing hypotheses against the holdout\nset, the analyst only learns the answer to the question: \"Is the given\nhypothesis true (empirically) on the holdout set?\" -- and no more information,\nsuch as \"how well\" the hypothesis fits the holdout set. The resulting scheme is\nimmediate to analyze, but despite its simplicity we do not believe our method\nis obvious, as evidenced by the many violations in practice.\n Our proposal can be seen as an alternative to pre-registration, and allows\nresearchers to get the benefits of adaptive data analysis without the problems\nof adaptivity.\n","authors":"Preetum Nakkiran|Jarosław Błasiok","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.05596v1","link_pdf":"http://arxiv.org/pdf/1809.05596v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ME","categories":"stat.ME|cs.LG|math.ST|stat.TH"} {"id":"1809.07839v1","submitted":"2018-09-20 20:24:35","updated":"2018-09-20 20:24:35","title":"Weak nodes detection in urban transport systems: Planning for resilience\n in Singapore","abstract":" The availability of massive data-sets describing human mobility offers the\npossibility to design simulation tools to monitor and improve the resilience of\ntransport systems in response to traumatic events such as natural and man-made\ndisasters (e.g. floods terroristic attacks, etc...). In this perspective, we\npropose ACHILLES, an application to model people's movements in a given\ntransport system mode through a multiplex network representation based on\nmobility data. ACHILLES is a web-based application which provides an\neasy-to-use interface to explore the mobility fluxes and the connectivity of\nevery urban zone in a city, as well as to visualize changes in the transport\nsystem resulting from the addition or removal of transport modes, urban zones,\nand single stops. Notably, our application allows the user to assess the\noverall resilience of the transport network by identifying its weakest node,\ni.e. Urban Achilles Heel, with reference to the ancient Greek mythology. To\ndemonstrate the impact of ACHILLES for humanitarian aid we consider its\napplication to a real-world scenario by exploring human mobility in Singapore\nin response to flood prevention.\n","authors":"Michele Ferretti|Gianni Barlacchi|Luca Pappalardo|Lorenzo Lucchini|Bruno Lepri","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.07839v1","link_pdf":"http://arxiv.org/pdf/1809.07839v1","link_doi":"","comment":"9 pages, 6 figures, IEEE Data Science and Advanced Analytics","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI|physics.soc-ph"} {"id":"1809.08935v1","submitted":"2018-09-21 12:04:44","updated":"2018-09-21 12:04:44","title":"Lexical Bias In Essay Level Prediction","abstract":" Automatically predicting the level of non-native English speakers given their\nwritten essays is an interesting machine learning problem. In this work I\npresent the system \"balikasg\" that achieved the state-of-the-art performance in\nthe CAp 2018 data science challenge among 14 systems. I detail the feature\nextraction, feature engineering and model selection steps and I evaluate how\nthese decisions impact the system's performance. The paper concludes with\nremarks for future work.\n","authors":"Georgios Balikas","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.08935v1","link_pdf":"http://arxiv.org/pdf/1809.08935v1","link_doi":"","comment":"CAp 2018","journal_ref":"","doi":"","primary_category":"cs.CL","categories":"cs.CL|cs.AI"} {"id":"1809.08614v2","submitted":"2018-09-23 15:25:31","updated":"2019-03-15 05:47:06","title":"Dynamical evolution of anti-social phenomena: A data science approach","abstract":" Human interactions can be either positive or negative, giving rise to\ndifferent complex social or anti-social phenomena. The dynamics of these\ninteractions often lead to certain spatio-temporal patterns and complex\nnetworks, which can be interesting to a wide range of researchers-- from social\nscientists to data scientists. Here, we use the publicly available data for a\nrange of anti-social and political events like ethnic conflicts, human right\nviolations and terrorist attacks across the globe. We aggregate these\nanti-social events over time and study the temporal evolution of these events.\nWe present here the results of several time-series analyses like recurrence\nintervals, Hurst R/S analysis, etc., that reveal the long memory of these\ntime-series. Further, we filter the data country-wise and study the time-series\nof these anti-social events within the individual countries. We find that the\ntime-series of these events have interesting statistical regularities and\ncorrelations using multi-dimensional scaling technique, the countries are then\ngrouped together in terms of the co-movements with respect to temporal growths\nof these anti-social events. The data science approaches to studying these\nanti-social phenomena may provide a deeper understanding about their formations\nand spreading. The results can help in framing public policies and creating\nstrategies that can check their spread and inhibit these anti-social\nphenomena.}\n","authors":"Syed Shariq Husain|Kiran Sharma","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.08614v2","link_pdf":"http://arxiv.org/pdf/1809.08614v2","link_doi":"","comment":"15 page, 5 figures. To appear in the Proceedings of Econophys-2017\n and APEC-2017, New Delhi","journal_ref":"","doi":"","primary_category":"physics.soc-ph","categories":"physics.soc-ph"} {"id":"1809.08705v1","submitted":"2018-09-24 00:12:28","updated":"2018-09-24 00:12:28","title":"On the Behavior of the Expectation-Maximization Algorithm for Mixture\n Models","abstract":" Finite mixture models are among the most popular statistical models used in\ndifferent data science disciplines. Despite their broad applicability,\ninference under these models typically leads to computationally challenging\nnon-convex problems. While the Expectation-Maximization (EM) algorithm is the\nmost popular approach for solving these non-convex problems, the behavior of\nthis algorithm is not well understood. In this work, we focus on the case of\nmixture of Laplacian (or Gaussian) distribution. We start by analyzing a simple\nequally weighted mixture of two single dimensional Laplacian distributions and\nshow that every local optimum of the population maximum likelihood estimation\nproblem is globally optimal. Then, we prove that the EM algorithm converges to\nthe ground truth parameters almost surely with random initialization. Our\nresult extends the existing results for Gaussian distribution to Laplacian\ndistribution. Then we numerically study the behavior of mixture models with\nmore than two components. Motivated by our extensive numerical experiments, we\npropose a novel stochastic method for estimating the mean of components of a\nmixture model. Our numerical experiments show that our algorithm outperforms\nthe Naive EM algorithm in almost all scenarios.\n","authors":"Babak Barazandeh|Meisam Razaviyayn","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.08705v1","link_pdf":"http://arxiv.org/pdf/1809.08705v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1809.08723v1","submitted":"2018-09-24 02:13:57","updated":"2018-09-24 02:13:57","title":"The Combinatorial Data Fusion Problem in Conflicted-supervised Learning","abstract":" The best merge problem in industrial data science generates instances where\ndisparate data sources place incompatible relational structures on the same set\n$V$ of objects. Graph vertex labelling data may include (1) missing or\nerroneous labels,(2) assertions that two vertices carry the same (unspecified)\nlabel, and (3) denying some subset of vertices from carrying the same label.\nConflicted-supervised learning applies to cases where no labelling scheme\nsatisfies (1), (2), and (3). Our rigorous formulation starts from a connected\nweighted graph $(V, E)$, and an independence system $\\mathcal{S}$ on $V$,\ncharacterized by its circuits, called forbidden sets. Global incompatibility is\nexpressed by the fact $V \\notin \\mathcal{S}$. Combinatorial data fusion seeks a\nsubset $E_1 \\subset E$ of maximum edge weight so that no vertex component of\nthe subgraph $(V, E_1)$ contains any forbidden set. Multicut and multiway cut\nare special cases where all forbidden sets have cardinality two. The general\ncase exhibits unintuitive properties, shown in counterexamples. The first in a\nseries of papers concentrates on cases where $(V, E)$ is a tree, and presents\nan algorithm on general graphs, in which the combinatorial data fusion problem\nis transferred to the Gomory-Hu tree, where it is solved using greedy set\ncover. Experimental results are given.\n","authors":"R. W. R. Darling|David G. Harris|Dev R. Phulara|John A. Proos","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.08723v1","link_pdf":"http://arxiv.org/pdf/1809.08723v1","link_doi":"","comment":"48 pages, 10 figures","journal_ref":"","doi":"","primary_category":"math.CO","categories":"math.CO|math.OC|05C85"} {"id":"1809.10024v1","submitted":"2018-09-24 19:23:28","updated":"2018-09-24 19:23:28","title":"Computational and informatics advances for reproducible data analysis in\n neuroimaging","abstract":" The reproducibility of scientific research has become a point of critical\nconcern. We argue that openness and transparency are critical for\nreproducibility, and we outline an ecosystem for open and transparent science\nthat has emerged within the human neuroimaging community. We discuss the range\nof open data sharing resources that have been developed for neuroimaging data,\nand the role of data standards (particularly the Brain Imaging Data Structure)\nin enabling the automated sharing, processing, and reuse of large neuroimaging\ndatasets. We outline how the open-source Python language has provided the basis\nfor a data science platform that enables reproducible data analysis and\nvisualization. We also discuss how new advances in software engineering, such\nas containerization, provide the basis for greater reproducibility in data\nanalysis. The emergence of this new ecosystem provides an example for many\nareas of science that are currently struggling with reproducibility.\n","authors":"Russell A. Poldrack|Krzysztof J. Gorgolewski|Gael Varoquaux","affiliations":"","link_abstract":"http://arxiv.org/abs/1809.10024v1","link_pdf":"http://arxiv.org/pdf/1809.10024v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|q-bio.NC|stat.ML"} {"id":"1810.02688v2","submitted":"2018-09-28 08:27:59","updated":"2018-10-19 13:07:57","title":"Wikistat 2.0: Educational Resources for Artificial Intelligence","abstract":" Big data, data science, deep learning, artificial intelligence are the key\nwords of intense hype related with a job market in full evolution, that impose\nto adapt the contents of our university professional trainings. Which\nartificial intelligence is mostly concerned by the job offers? Which\nmethodologies and technologies should be favored in the training programs?\nWhich objectives, tools and educational resources do we needed to put in place\nto meet these pressing needs? We answer these questions in describing the\ncontents and operational resources in the Data Science orientation of the\nspecialty Applied Mathematics at INSA Toulouse. We focus on basic mathematics\ntraining (Optimization, Probability, Statistics), associated with the practical\nimplementation of the most performing statistical learning algorithms, with the\nmost appropriate technologies and on real examples. Considering the huge\nvolatility of the technologies, it is imperative to train students in\nseft-training, this will be their technological watch tool when they will be in\nprofessional activity. This explains the structuring of the educational site\ngithub.com/wikistat into a set of tutorials. Finally, to motivate the thorough\npractice of these tutorials, a serious game is organized each year in the form\nof a prediction contest between students of Master degrees in Applied\nMathematics for IA.\n","authors":"Philippe Besse|Brendan Guillouet|Béatrice Laurent","affiliations":"IMT|IMT|IMT","link_abstract":"http://arxiv.org/abs/1810.02688v2","link_pdf":"http://arxiv.org/pdf/1810.02688v2","link_doi":"","comment":"in French","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.AI|math.ST|stat.ML|stat.TH"} {"id":"1810.01869v1","submitted":"2018-10-03 13:22:44","updated":"2018-10-03 13:22:44","title":"Machine Learning Suites for Online Toxicity Detection","abstract":" To identify and classify toxic online commentary, the modern tools of data\nscience transform raw text into key features from which either thresholding or\nlearning algorithms can make predictions for monitoring offensive\nconversations. We systematically evaluate 62 classifiers representing 19 major\nalgorithmic families against features extracted from the Jigsaw dataset of\nWikipedia comments. We compare the classifiers based on statistically\nsignificant differences in accuracy and relative execution time. Among these\nclassifiers for identifying toxic comments, tree-based algorithms provide the\nmost transparently explainable rules and rank-order the predictive contribution\nof each feature. Among 28 features of syntax, sentiment, emotion and outlier\nword dictionaries, a simple bad word list proves most predictive of offensive\ncommentary.\n","authors":"David Noever","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.01869v1","link_pdf":"http://arxiv.org/pdf/1810.01869v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.CL|cs.NE|stat.ML"} {"id":"1810.01943v1","submitted":"2018-10-03 20:18:35","updated":"2018-10-03 20:18:35","title":"AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and\n Mitigating Unwanted Algorithmic Bias","abstract":" Fairness is an increasingly important concern as machine learning models are\nused to support decision making in high-stakes applications such as mortgage\nlending, hiring, and prison sentencing. This paper introduces a new open source\nPython toolkit for algorithmic fairness, AI Fairness 360 (AIF360), released\nunder an Apache v2.0 license {https://github.com/ibm/aif360). The main\nobjectives of this toolkit are to help facilitate the transition of fairness\nresearch algorithms to use in an industrial setting and to provide a common\nframework for fairness researchers to share and evaluate algorithms.\n The package includes a comprehensive set of fairness metrics for datasets and\nmodels, explanations for these metrics, and algorithms to mitigate bias in\ndatasets and models. It also includes an interactive Web experience\n(https://aif360.mybluemix.net) that provides a gentle introduction to the\nconcepts and capabilities for line-of-business users, as well as extensive\ndocumentation, usage guidance, and industry-specific tutorials to enable data\nscientists and practitioners to incorporate the most appropriate tool for their\nproblem into their work products. The architecture of the package has been\nengineered to conform to a standard paradigm used in data science, thereby\nfurther improving usability for practitioners. Such architectural design and\nabstractions enable researchers and developers to extend the toolkit with their\nnew algorithms and improvements, and to use it for performance benchmarking. A\nbuilt-in testing infrastructure maintains code quality.\n","authors":"Rachel K. E. Bellamy|Kuntal Dey|Michael Hind|Samuel C. Hoffman|Stephanie Houde|Kalapriya Kannan|Pranay Lohia|Jacquelyn Martino|Sameep Mehta|Aleksandra Mojsilovic|Seema Nagar|Karthikeyan Natesan Ramamurthy|John Richards|Diptikalyan Saha|Prasanna Sattigeri|Moninder Singh|Kush R. Varshney|Yunfeng Zhang","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.01943v1","link_pdf":"http://arxiv.org/pdf/1810.01943v1","link_doi":"","comment":"20 pages","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI"} {"id":"1810.02893v1","submitted":"2018-10-05 21:36:25","updated":"2018-10-05 21:36:25","title":"Optimization on Spheres: Models and Proximal Algorithms with\n Computational Performance Comparisons","abstract":" We present a unified treatment of the abstract problem of finding the best\napproximation between a cone and spheres in the image of affine\ntransformations. Prominent instances of this problem are phase retrieval and\nsource localization. The common geometry binding these problems permits a\ngeneric application of algorithmic ideas and abstract convergence results for\nnonconvex optimization. We organize variational models for this problem into\nthree different classes and derive the main algorithmic approaches within these\nclasses (13 in all). We identify the central ideas underlying these methods and\nprovide thorough numerical benchmarks comparing their performance on synthetic\nand laboratory data. The software and data of our experiments are all publicly\naccessible. We also introduce one new algorithm, a cyclic relaxed\nDouglas-Rachford algorithm, which outperforms all other algorithms by every\nmeasure: speed, stability and accuracy. The analysis of this algorithm remains\nopen.\n","authors":"D. Russell Luke|Shoham Sabach|Marc Teboulle","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.02893v1","link_pdf":"http://arxiv.org/pdf/1810.02893v1","link_doi":"http://dx.doi.org/10.1137/18M1193025","comment":"23 pages, 11 benchmarking studies","journal_ref":"SIAM J. Mathematics of Data Science, 1(3), 408-445 (2019)","doi":"10.1137/18M1193025","primary_category":"math.OC","categories":"math.OC"} {"id":"1810.03198v1","submitted":"2018-10-07 19:25:48","updated":"2018-10-07 19:25:48","title":"Reinforcement Evolutionary Learning Method for self-learning","abstract":" In statistical modelling the biggest threat is concept drift which makes the\nmodel gradually showing deteriorating performance over time. There are state of\nthe art methodologies to detect the impact of concept drift, however general\nstrategy considered to overcome the issue in performance is to rebuild or\nre-calibrate the model periodically as the variable patterns for the model\nchanges significantly due to market change or consumer behavior change etc.\nQuantitative research is the most widely spread application of data science in\nMarketing or financial domain where applicability of state of the art\nreinforcement learning for auto-learning is less explored paradigm.\nReinforcement learning is heavily dependent on having a simulated environment\nwhich is majorly available for gaming or online systems, to learn from the live\nfeedback. However, there are some research happened on the area of online\nadvertisement, pricing etc where due to the nature of the online learning\nenvironment scope of reinforcement learning is explored. Our proposed solution\nis a reinforcement learning based, true self-learning algorithm which can adapt\nto the data change or concept drift and auto learn and self-calibrate for the\nnew patterns of the data solving the problem of concept drift.\n Keywords - Reinforcement learning, Genetic Algorithm, Q-learning,\nClassification modelling, CMA-ES, NES, Multi objective optimization, Concept\ndrift, Population stability index, Incremental learning, F1-measure, Predictive\nModelling, Self-learning, MCTS, AlphaGo, AlphaZero\n","authors":"Kumarjit Pathak|Jitin Kapila","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.03198v1","link_pdf":"http://arxiv.org/pdf/1810.03198v1","link_doi":"","comment":"5 figures","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.NE|stat.ML"} {"id":"1810.04599v2","submitted":"2018-10-10 15:40:27","updated":"2018-10-16 05:05:07","title":"Understanding Data Science Lifecycle Provenance via Graph Segmentation\n and Summarization","abstract":" Increasingly modern data science platforms today have non-intrusive and\nextensible provenance ingestion mechanisms to collect rich provenance and\ncontext information, handle modifications to the same file using\ndistinguishable versions, and use graph data models (e.g., property graphs) and\nquery languages (e.g., Cypher) to represent and manipulate the stored\nprovenance/context information. Due to the schema-later nature of the metadata,\nmultiple versions of the same files, and unfamiliar artifacts introduced by\nteam members, the \"provenance graph\" is verbose and evolving, and hard to\nunderstand; using standard graph query model, it is difficult to compose\nqueries and utilize this valuable information.\n In this paper, we propose two high-level graph query operators to address the\nverboseness and evolving nature of such provenance graphs. First, we introduce\na graph segmentation operator, which queries the retrospective provenance\nbetween a set of source vertices and a set of destination vertices via flexible\nboundary criteria to help users get insight about the derivation relationships\namong those vertices. We show the semantics of such a query in terms of a\ncontext-free grammar, and develop efficient algorithms that run orders of\nmagnitude faster than state-of-the-art. Second, we propose a graph\nsummarization operator that combines similar segments together to query\nprospective provenance of the underlying project. The operator allows tuning\nthe summary by ignoring vertex details and characterizing local structures, and\nensures the provenance meaning using path constraints. We show the optimal\nsummary problem is PSPACE-complete and develop effective approximation\nalgorithms. The operators are implemented on top of a property graph backend.\nWe evaluate our query methods extensively and show the effectiveness and\nefficiency of the proposed methods.\n","authors":"Hui Miao|Amol Deshpande","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.04599v2","link_pdf":"http://arxiv.org/pdf/1810.04599v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB"} {"id":"1810.04668v1","submitted":"2018-10-10 18:25:02","updated":"2018-10-10 18:25:02","title":"Intrusion Detection Using Mouse Dynamics","abstract":" Compared to other behavioural biometrics, mouse dynamics is a less explored\narea. General purpose data sets containing unrestricted mouse usage data are\nusually not available. The Balabit data set was released in 2016 for a data\nscience competition, which against the few subjects, can be considered the\nfirst adequate publicly available one. This paper presents a performance\nevaluation study on this data set for impostor detection. The existence of very\nshort test sessions makes this data set challenging. Raw data were segmented\ninto mouse move, point and click and drag and drop types of mouse actions, then\nseveral features were extracted. In contrast to keystroke dynamics, mouse data\nis not sensitive, therefore it is possible to collect negative mouse dynamics\ndata and to use two-class classifiers for impostor detection. Both action- and\nset of actions-based evaluations were performed. Set of actions-based\nevaluation achieves 0.92 AUC on the test part of the data set. However, the\nsame type of evaluation conducted on the training part of the data set resulted\nin maximal AUC (1) using only 13 actions. Drag and drop mouse actions proved to\nbe the best actions for impostor detection.\n","authors":"Margit Antal|Elod Egyed-Zsigmond","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.04668v1","link_pdf":"http://arxiv.org/pdf/1810.04668v1","link_doi":"","comment":"Submitted to IET Biometrics on 23 May 2018","journal_ref":"","doi":"","primary_category":"cs.HC","categories":"cs.HC|cs.CR|cs.LG"} {"id":"1811.02491v1","submitted":"2018-10-15 20:19:46","updated":"2018-10-15 20:19:46","title":"Mobile Data Science: Towards Understanding Data-Driven Intelligent\n Mobile Applications","abstract":" Due to the popularity of smart mobile phones and context-aware technology,\nvarious contextual data relevant to users' diverse activities with mobile\nphones is available around us. This enables the study on mobile phone data and\ncontext-awareness in computing, for the purpose of building data-driven\nintelligent mobile applications, not only on a single device but also in a\ndistributed environment for the benefit of end users. Based on the availability\nof mobile phone data, and the usefulness of data-driven applications, in this\npaper, we discuss about mobile data science that involves in collecting the\nmobile phone data from various sources and building data-driven models using\nmachine learning techniques, in order to make dynamic decisions intelligently\nin various day-to-day situations of the users. For this, we first discuss the\nfundamental concepts and the potentiality of mobile data science to build\nintelligent applications. We also highlight the key elements and explain\nvarious key modules involving in the process of mobile data science. This\narticle is the first in the field to draw a big picture, and thinking about\nmobile data science, and it's potentiality in developing various data-driven\nintelligent mobile applications. We believe this study will help both the\nresearchers and application developers for building smart data-driven mobile\napplications, to assist the end mobile phone users in their daily activities.\n","authors":"Iqbal H. Sarker","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.02491v1","link_pdf":"http://arxiv.org/pdf/1811.02491v1","link_doi":"","comment":"Journal, 11 pages, Double Column","journal_ref":"EAI Endorsed Transactions on Scalable Information Systems, 2018","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.LG"} {"id":"1810.07324v1","submitted":"2018-10-17 00:11:20","updated":"2018-10-17 00:11:20","title":"A Short Introduction to Local Graph Clustering Methods and Software","abstract":" Graph clustering has many important applications in computing, but due to the\nincreasing sizes of graphs, even traditionally fast clustering methods can be\ncomputationally expensive for real-world graphs of interest. Scalability\nproblems led to the development of local graph clustering algorithms that come\nwith a variety of theoretical guarantees. Rather than return a global\nclustering of the entire graph, local clustering algorithms return a single\ncluster around a given seed node or set of seed nodes. These algorithms improve\nscalability because they use time and memory resources that depend only on the\nsize of the cluster returned, instead of the size of the input graph. Indeed,\nfor many of them, their running time grows linearly with the size of the\noutput. In addition to scalability arguments, local graph clustering algorithms\nhave proven to be very useful for identifying and interpreting small-scale and\nmeso-scale structure in large-scale graphs. As opposed to heuristic operational\nprocedures, this class of algorithms comes with strong algorithmic and\nstatistical theory. These include statistical guarantees that prove they have\nimplicit regularization properties. One of the challenges with the existing\nliterature on these approaches is that they are published in a wide variety of\nareas, including theoretical computer science, statistics, data science, and\nmathematics. This has made it difficult to relate the various algorithms and\nideas together into a cohesive whole. We have recently been working on unifying\nthese diverse perspectives through the lens of optimization as well as\nproviding software to perform these computations in a cohesive fashion. In this\nnote, we provide a brief introduction to local graph clustering, we provide\nsome representative examples of our perspective, and we introduce our software\nnamed Local Graph Clustering (LGC).\n","authors":"Kimon Fountoulakis|David F. Gleich|Michael W. Mahoney","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.07324v1","link_pdf":"http://arxiv.org/pdf/1810.07324v1","link_doi":"","comment":"3 pages, 2 figures","journal_ref":"Complex Networks 2018","doi":"","primary_category":"cs.SI","categories":"cs.SI"} {"id":"1810.08278v1","submitted":"2018-10-18 21:13:45","updated":"2018-10-18 21:13:45","title":"Interpolating between Optimal Transport and MMD using Sinkhorn\n Divergences","abstract":" Comparing probability distributions is a fundamental problem in data\nsciences. Simple norms and divergences such as the total variation and the\nrelative entropy only compare densities in a point-wise manner and fail to\ncapture the geometric nature of the problem. In sharp contrast, Maximum Mean\nDiscrepancies (MMD) and Optimal Transport distances (OT) are two classes of\ndistances between measures that take into account the geometry of the\nunderlying space and metrize the convergence in law.\n This paper studies the Sinkhorn divergences, a family of geometric\ndivergences that interpolates between MMD and OT. Relying on a new notion of\ngeometric entropy, we provide theoretical guarantees for these divergences:\npositivity, convexity and metrization of the convergence in law. On the\npractical side, we detail a numerical scheme that enables the large scale\napplication of these divergences for machine learning: on the GPU, gradients of\nthe Sinkhorn loss can be computed for batches of a million samples.\n","authors":"Jean Feydy|Thibault Séjourné|François-Xavier Vialard|Shun-ichi Amari|Alain Trouvé|Gabriel Peyré","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.08278v1","link_pdf":"http://arxiv.org/pdf/1810.08278v1","link_doi":"","comment":"15 pages, 5 figures","journal_ref":"","doi":"","primary_category":"math.ST","categories":"math.ST|stat.TH|62"} {"id":"1810.08553v3","submitted":"2018-10-19 15:36:35","updated":"2019-03-14 16:13:30","title":"Federated Learning in Distributed Medical Databases: Meta-Analysis of\n Large-Scale Subcortical Brain Data","abstract":" At this moment, databanks worldwide contain brain images of previously\nunimaginable numbers. Combined with developments in data science, these massive\ndata provide the potential to better understand the genetic underpinnings of\nbrain diseases. However, different datasets, which are stored at different\ninstitutions, cannot always be shared directly due to privacy and legal\nconcerns, thus limiting the full exploitation of big data in the study of brain\ndisorders. Here we propose a federated learning framework for securely\naccessing and meta-analyzing any biomedical data without sharing individual\ninformation. We illustrate our framework by investigating brain structural\nrelationships across diseases and clinical cohorts. The framework is first\ntested on synthetic data and then applied to multi-centric, multi-database\nstudies including ADNI, PPMI, MIRIAD and UK Biobank, showing the potential of\nthe approach for further applications in distributed analysis of multi-centric\ncohorts\n","authors":"Santiago Silva|Boris Gutman|Eduardo Romero|Paul M Thompson|Andre Altmann|Marco Lorenzi","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.08553v3","link_pdf":"http://arxiv.org/pdf/1810.08553v3","link_doi":"","comment":"Federated learning, distributed databases, PCA, SVD, meta-analysis,\n brain disease","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG|q-bio.NC|q-bio.QM"} {"id":"1810.08728v2","submitted":"2018-10-20 01:41:05","updated":"2018-10-29 13:51:21","title":"Roadmap for Reliable Ensemble Forecasting of the Sun-Earth System","abstract":" The authors of this report met on 28-30 March 2018 at the New Jersey\nInstitute of Technology, Newark, New Jersey, for a 3-day workshop that brought\ntogether a group of data providers, expert modelers, and computer and data\nscientists, in the solar discipline. Their objective was to identify challenges\nin the path towards building an effective framework to achieve transformative\nadvances in the understanding and forecasting of the Sun-Earth system from the\nupper convection zone of the Sun to the Earth's magnetosphere. The workshop\naimed to develop a research roadmap that targets the scientific challenge of\ncoupling observations and modeling with emerging data-science research to\nextract knowledge from the large volumes of data (observed and simulated) while\nstimulating computer science with new research applications. The desire among\nthe attendees was to promote future trans-disciplinary collaborations and\nidentify areas of convergence across disciplines. The workshop combined a set\nof plenary sessions featuring invited introductory talks and workshop progress\nreports, interleaved with a set of breakout sessions focused on specific topics\nof interest. Each breakout group generated short documents, listing the\nchallenges identified during their discussions in addition to possible ways of\nattacking them collectively. These documents were combined into this\nreport-wherein a list of prioritized activities have been collated, shared and\nendorsed.\n","authors":"Gelu Nita|Rafal Angryk|Berkay Aydin|Juan Banda|Tim Bastian|Tom Berger|Veronica Bindi|Laura Boucheron|Wenda Cao|Eric Christian|Georgia de Nolfo|Edward DeLuca|Marc DeRosa|Cooper Downs|Gregory Fleishman|Olac Fuentes|Dale Gary|Frank Hill|Todd Hoeksema|Qiang Hu|Raluca Ilie|Jack Ireland|Farzad Kamalabadi|Kelly Korreck|Alexander Kosovichev|Jessica Lin|Noe Lugaz|Anthony Mannucci|Nagi Mansour|Petrus Martens|Leila Mays|James McAteer|Scott W. McIntosh|Vincent Oria|David Pan|Marco Panesi|W. Dean Pesnell|Alexei Pevtsov|Valentin Pillet|Laurel Rachmeler|Aaron Ridley|Ludger Scherliess|Gabor Toth|Marco Velli|Stephen White|Jie Zhang|Shasha Zou","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.08728v2","link_pdf":"http://arxiv.org/pdf/1810.08728v2","link_doi":"","comment":"Workshop Report","journal_ref":"","doi":"","primary_category":"astro-ph.SR","categories":"astro-ph.SR"} {"id":"1810.09646v2","submitted":"2018-10-23 03:44:02","updated":"2020-02-21 02:17:55","title":"Gromov-Monge quasi-metrics and distance distributions","abstract":" Applications in data science, shape analysis and object classification\nfrequently require maps between metric spaces which preserve geometry as\nfaithfully as possible. In this paper, we combine the Monge formulation of\noptimal transport with the Gromov-Hausdorff distance construction to define a\nmeasure of the minimum amount of geometric distortion required to map one\nmetric measure space onto another. We show that the resulting quantity, called\nGromov-Monge distance, defines an extended quasi-metric on the space of\nisomorphism classes of metric measure spaces and that it can be promoted to a\ntrue metric on certain subclasses of mm-spaces. We also give precise\ncomparisons between Gromov-Monge distance and several other metrics which have\nappeared previously, such as the Gromov-Wasserstein metric and the continuous\nProcrustes metric of Lipman, Al-Aifari and Daubechies. Finally, we derive\npolynomial-time computable lower bounds for Gromov-Monge distance. These lower\nbounds are expressed in terms of distance distributions, which are classical\ninvariants of metric measure spaces summarizing the volume growth of metric\nballs. In the second half of the paper, which may be of independent interest,\nwe study the discriminative power of these lower bounds for simple subclasses\nof metric measure spaces. We first consider the case of planar curves, where we\ngive a counterexample to the Curve Histogram Conjecture of Brinkman and Olver.\nOur results on plane curves are then generalized to higher dimensional\nmanifolds, where we prove some sphere characterization theorems for the\ndistance distribution invariant. Finally, we consider several inverse problems\non recovering a metric graph from a collection of localized versions of\ndistance distributions. Results are derived by establishing connections with\nconcepts from the fields of computational geometry and topological data\nanalysis.\n","authors":"Facundo Mémoli|Tom Needham","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.09646v2","link_pdf":"http://arxiv.org/pdf/1810.09646v2","link_doi":"","comment":"Version 2: Added many new results and improved exposition","journal_ref":"","doi":"","primary_category":"math.MG","categories":"math.MG"} {"id":"1810.10308v1","submitted":"2018-10-23 11:54:23","updated":"2018-10-23 11:54:23","title":"TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets","abstract":" Publicly available social media archives facilitate research in a variety of\nfields, such as data science, sociology or the digital humanities, where\nTwitter has emerged as one of the most prominent sources. However, obtaining,\narchiving and annotating large amounts of tweets is costly. In this paper, we\ndescribe TweetsKB, a publicly available corpus of currently more than 1.5\nbillion tweets, spanning almost 5 years (Jan'13-Nov'17). Metadata information\nabout the tweets as well as extracted entities, hashtags, user mentions and\nsentiment information are exposed using established RDF/S vocabularies. Next to\na description of the extraction and annotation process, we present use cases to\nillustrate scenarios for entity-centric information exploration, data\nintegration and knowledge discovery facilitated by TweetsKB.\n","authors":"Pavlos Fafalios|Vasileios Iosifidis|Eirini Ntoutsi|Stefan Dietze","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.10308v1","link_pdf":"http://arxiv.org/pdf/1810.10308v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.IR","categories":"cs.IR"} {"id":"1810.10989v3","submitted":"2018-10-25 17:23:53","updated":"2018-12-17 13:33:40","title":"Reducing over-smoothness in speech synthesis using Generative\n Adversarial Networks","abstract":" Speech synthesis is widely used in many practical applications. In recent\nyears, speech synthesis technology has developed rapidly. However, one of the\nreasons why synthetic speech is unnatural is that it often has over-smoothness.\nIn order to improve the naturalness of synthetic speech, we first extract the\nmel-spectrogram of speech and convert it into a real image, then take the\nover-smooth mel-spectrogram image as input, and use image-to-image translation\nGenerative Adversarial Networks(GANs) framework to generate a more realistic\nmel-spectrogram. Finally, the results show that this method greatly reduces the\nover-smoothness of synthesized speech and is more close to the mel-spectrogram\nof real speech.\n","authors":"Leyuan Sheng|Evgeniy N. Pavlovskiy","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.10989v3","link_pdf":"http://arxiv.org/pdf/1810.10989v3","link_doi":"","comment":"Accepted by Siberian Symposium on Data Science and Engineering\n (SSDSE) 2018","journal_ref":"","doi":"","primary_category":"cs.SD","categories":"cs.SD|eess.AS"} {"id":"1810.11185v2","submitted":"2018-10-26 04:15:51","updated":"2019-02-01 02:46:46","title":"Beyond A/B Testing: Sequential Randomization for Developing\n Interventions in Scaled Digital Learning Environments","abstract":" Randomized experiments ensure robust causal inference that are critical to\neffective learning analytics research and practice. However, traditional\nrandomized experiments, like A/B tests, are limiting in large scale digital\nlearning environments. While traditional experiments can accurately compare two\ntreatment options, they are less able to inform how to adapt interventions to\ncontinually meet learners' diverse needs. In this work, we introduce a trial\ndesign for developing adaptive interventions in scaled digital learning\nenvironments -- the sequential randomized trial (SRT). With the goal of\nimproving learner experience and developing interventions that benefit all\nlearners at all times, SRTs inform how to sequence, time, and personalize\ninterventions. In this paper, we provide an overview of SRTs, and we illustrate\nthe advantages they hold compared to traditional experiments. We describe a\nnovel SRT run in a large scale data science MOOC. The trial results\ncontextualize how learner engagement can be addressed through inclusive\nculturally targeted reminder emails. We also provide practical advice for\nresearchers who aim to run their own SRTs to develop adaptive interventions in\nscaled digital learning environments.\n","authors":"Timothy NeCamp|Josh Gardner|Christopher Brooks","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.11185v2","link_pdf":"http://arxiv.org/pdf/1810.11185v2","link_doi":"http://dx.doi.org/10.1145/3303772.3303812","comment":"","journal_ref":"2019, The 9th International Learning Analytics & Knowledge\n Conference, Tempe, AZ, USA. ACM, New York, NY, USA","doi":"10.1145/3303772.3303812","primary_category":"stat.AP","categories":"stat.AP|cs.CY"} {"id":"1810.11957v1","submitted":"2018-10-29 05:08:16","updated":"2018-10-29 05:08:16","title":"Evolutionary Self-Expressive Models for Subspace Clustering","abstract":" The problem of organizing data that evolves over time into clusters is\nencountered in a number of practical settings. We introduce evolutionary\nsubspace clustering, a method whose objective is to cluster a collection of\nevolving data points that lie on a union of low-dimensional evolving subspaces.\nTo learn the parsimonious representation of the data points at each time step,\nwe propose a non-convex optimization framework that exploits the\nself-expressiveness property of the evolving data while taking into account\nrepresentation from the preceding time step. To find an approximate solution to\nthe aforementioned non-convex optimization problem, we develop a scheme based\non alternating minimization that both learns the parsimonious representation as\nwell as adaptively tunes and infers a smoothing parameter reflective of the\nrate of data evolution. The latter addresses a fundamental challenge in\nevolutionary clustering -- determining if and to what extent one should\nconsider previous clustering solutions when analyzing an evolving data\ncollection. Our experiments on both synthetic and real-world datasets\ndemonstrate that the proposed framework outperforms state-of-the-art static\nsubspace clustering algorithms and existing evolutionary clustering schemes in\nterms of both accuracy and running time, in a range of scenarios.\n","authors":"Abolfazl Hashemi|Haris Vikalo","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.11957v1","link_pdf":"http://arxiv.org/pdf/1810.11957v1","link_doi":"http://dx.doi.org/10.1109/JSTSP.2018.2877478","comment":"","journal_ref":"IEEE Journal of Selected Topics in Signal Processing, Special\n Issue on Data Science: Robust Subspace Learning and Tracking, vol. 12, no. 6,\n December 2018","doi":"10.1109/JSTSP.2018.2877478","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1810.12379v1","submitted":"2018-10-29 19:42:41","updated":"2018-10-29 19:42:41","title":"Renarration for All","abstract":" The accessibility of content for all has been a key goal of the Web since its\nconception. However, true accessibility -- access to relevant content in the\nglobal context -- has been elusive for reasons that extend beyond physical\naccessibility issues. Among them are the spoken languages, literacy levels,\nexpertise, and culture. These issues are highly significant, since information\nmay not reach those who are the most in need of it. For example, the minimum\nwage laws that are published in legalese on government sites and the\nlow-literate and immigrant populations. While some organizations and volunteers\nwork on bridging such gaps by creating and disseminating alternative versions\nof such content, Web scale solutions much be developed to take advantage of its\ndistributed dissemination capabilities. This work examines content\naccessibility from the perspective of inclusiveness. For this purpose, a human\nin the loop approach for renarrating Web content is proposed, where a\nrenarrator creates an alternative narrative of some Web content with the intent\nof extending its reach. A renarration relates some Web content with an\nalternative version by means of transformations like simplification,\nelaboration, translation, or production of audio and video material. This work\npresents a model and a basic architecture for supporting renarrations along\nwith various scenarios. We also discuss the potentials of the W3C specification\nfor Web Annotation Data Model towards a more inclusive and decentralized social\nweb.\n","authors":"T. B. Dinesh|S. Uskudarli","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.12379v1","link_pdf":"http://arxiv.org/pdf/1810.12379v1","link_doi":"","comment":"","journal_ref":"IIITB Data Science Communications, vol 1, (2016)","doi":"","primary_category":"cs.HC","categories":"cs.HC|cs.CY"} {"id":"1810.12847v2","submitted":"2018-10-30 16:41:22","updated":"2018-11-01 12:34:32","title":"AI for the Common Good?! Pitfalls, challenges, and Ethics Pen-Testing","abstract":" Recently, many AI researchers and practitioners have embarked on research\nvisions that involve doing AI for \"Good\". This is part of a general drive\ntowards infusing AI research and practice with ethical thinking. One frequent\ntheme in current ethical guidelines is the requirement that AI be good for all,\nor: contribute to the Common Good. But what is the Common Good, and is it\nenough to want to be good? Via four lead questions, I will illustrate\nchallenges and pitfalls when determining, from an AI point of view, what the\nCommon Good is and how it can be enhanced by AI. The questions are: What is the\nproblem / What is a problem?, Who defines the problem?, What is the role of\nknowledge?, and What are important side effects and dynamics? The illustration\nwill use an example from the domain of \"AI for Social Good\", more specifically\n\"Data Science for Social Good\". Even if the importance of these questions may\nbe known at an abstract level, they do not get asked sufficiently in practice,\nas shown by an exploratory study of 99 contributions to recent conferences in\nthe field. Turning these challenges and pitfalls into a positive\nrecommendation, as a conclusion I will draw on another characteristic of\ncomputer-science thinking and practice to make these impediments visible and\nattenuate them: \"attacks\" as a method for improving design. This results in the\nproposal of ethics pen-testing as a method for helping AI designs to better\ncontribute to the Common Good.\n","authors":"Bettina Berendt","affiliations":"","link_abstract":"http://arxiv.org/abs/1810.12847v2","link_pdf":"http://arxiv.org/pdf/1810.12847v2","link_doi":"","comment":"to appear in Paladyn. Journal of Behavioral Robotics; accepted on\n 27-10-2018","journal_ref":"","doi":"","primary_category":"cs.AI","categories":"cs.AI|cs.CY"} {"id":"1811.00591v1","submitted":"2018-11-01 19:00:29","updated":"2018-11-01 19:00:29","title":"Defining a Metric Space of Host Logs and Operational Use Cases","abstract":" Host logs, in particular, Windows Event Logs, are a valuable source of\ninformation often collected by security operation centers (SOCs). The\nsemi-structured nature of host logs inhibits automated analytics, and while\nmanual analysis is common, the sheer volume makes manual inspection of all logs\nimpossible. Although many powerful algorithms for analyzing time-series and\nsequential data exist, utilization of such algorithms for most cyber security\napplications is either infeasible or requires tailored, research-intensive\npreparations. In particular, basic mathematic and algorithmic developments for\nproviding a generalized, meaningful similarity metric on system logs is needed\nto bridge the gap between many existing sequential data mining methods and this\ncurrently available but under-utilized data source. In this paper, we provide a\nrigorous definition of a metric product space on Windows Event Logs, providing\nan embedding that allows for the application of established machine learning\nand time-series analysis methods. We then demonstrate the utility and\nflexibility of this embedding with multiple use-cases on real data: (1)\ncomparing known infected to new host log streams for attack detection and\nforensics, (2) collapsing similar streams of logs into semantically-meaningful\ngroups (by user, by role), thereby reducing the quantity of data but not the\ncontent, (3) clustering logs as well as short sequences of logs to identify and\nvisualize user behaviors and background processes over time. Overall, we\nprovide a metric space framework for general host logs and log sequences that\nrespects semantic similarity and facilitates a wide variety of data science\nanalytics to these logs without data-specific preparations for each.\n","authors":"Miki E. Verma|Robert A. Bridges","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.00591v1","link_pdf":"http://arxiv.org/pdf/1811.00591v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.AP","categories":"stat.AP|cs.CR"} {"id":"1811.00731v2","submitted":"2018-11-02 04:14:09","updated":"2019-06-17 04:30:19","title":"The age of secrecy and unfairness in recidivism prediction","abstract":" In our current society, secret algorithms make important decisions about\nindividuals. There has been substantial discussion about whether these\nalgorithms are unfair to groups of individuals. While noble, this pursuit is\ncomplex and ultimately stagnating because there is no clear definition of\nfairness and competing definitions are largely incompatible. We argue that the\nfocus on the question of fairness is misplaced, as these algorithms fail to\nmeet a more important and yet readily obtainable goal: transparency. As a\nresult, creators of secret algorithms can provide incomplete or misleading\ndescriptions about how their models work, and various other kinds of errors can\neasily go unnoticed. By partially reverse engineering the COMPAS algorithm -- a\nrecidivism-risk scoring algorithm used throughout the criminal justice system\n-- we show that it does not seem to depend linearly on the defendant's age,\ndespite statements to the contrary by the algorithm's creator. Furthermore, by\nsubtracting from COMPAS its (hypothesized) nonlinear age component, we show\nthat COMPAS does not necessarily depend on race, contradicting ProPublica's\nanalysis, which assumed linearity in age. In other words, faulty assumptions\nabout a proprietary algorithm lead to faulty conclusions that go unchecked\nwithout careful reverse engineering. Were the algorithm transparent in the\nfirst place, this would likely not have occurred. The most important result in\nthis work is that we find that there are many defendants with low risk score\nbut long criminal histories, suggesting that data inconsistencies occur\nfrequently in criminal justice databases. We argue that transparency satisfies\na different notion of procedural fairness by providing both the defendants and\nthe public with the opportunity to scrutinize the methodology and calculations\nbehind risk scores for recidivism.\n","authors":"Cynthia Rudin|Caroline Wang|Beau Coker","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.00731v2","link_pdf":"http://arxiv.org/pdf/1811.00731v2","link_doi":"","comment":"","journal_ref":"Harvard Data Science Review 2(1), 2020","doi":"","primary_category":"stat.AP","categories":"stat.AP|cs.CY"} {"id":"1811.01114v2","submitted":"2018-11-02 22:36:45","updated":"2020-06-03 00:54:52","title":"Geometric characterization of data sets with unique reduced Gröbner\n bases","abstract":" Model selection based on experimental data is an important challenge in\nbiological data science. Particularly when collecting data is expensive or time\nconsuming, as it is often the case with clinical trial and biomolecular\nexperiments, the problem of selecting information-rich data becomes crucial for\ncreating relevant models. We identify geometric properties of input data that\nresult in a unique algebraic model and we show that if the data form a\nstaircase, or a so-called linear shift of a staircase, the ideal of the points\nhas a unique reduced Gro \\\"bner basis and thus corresponds to a unique model.\nWe use linear shifts to partition data into equivalence classes with the same\nbasis. We demonstrate the utility of the results by applying them to a Boolean\nmodel of the well-studied lac operon in E. coli.\n","authors":"Elena S. Dimitrova|Qijun He|Brandilyn Stigler|Anyu Zhang","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.01114v2","link_pdf":"http://arxiv.org/pdf/1811.01114v2","link_doi":"http://dx.doi.org/10.1007/s11538-019-00624-x","comment":"","journal_ref":"Bulletin of Mathematical Biology 81 (2019) 2691-2705","doi":"10.1007/s11538-019-00624-x","primary_category":"math.AG","categories":"math.AG"} {"id":"1811.02021v1","submitted":"2018-11-05 20:30:49","updated":"2018-11-05 20:30:49","title":"Using GitHub Classroom To Teach Statistics","abstract":" Git and GitHub are common tools for keeping track of multiple versions of\ndata analytic content, which allow for more than one person to simultaneously\nwork on a project. GitHub Classroom aims to provide a way for students to work\non and submit their assignments via Git and GitHub, giving teachers an\nopportunity to teach these version control tools as part of their course. In\nthe Fall 2017 semester, we implemented GitHub Classroom in two educational\nsettings--an introductory computational statistics lab and a more advanced\ncomputational statistics course. We found many educational benefits of\nimplementing GitHub Classroom, such as easily providing coding feedback during\nassignments and making students more confident in their ability to collaborate\nand use version control tools for future data science work. To encourage and\nease the transition into using GitHub Classroom, we provide free and publicly\navailable resources--both for students to begin using Git/GitHub and for\nteachers to use GitHub Classroom for their own courses.\n","authors":"Jacob Fiksel|Johanna S. Hardin|Leah R. Jager|Margaret A. Taub","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.02021v1","link_pdf":"http://arxiv.org/pdf/1811.02021v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT|stat.AP"} {"id":"1811.03435v3","submitted":"2018-11-06 03:11:09","updated":"2020-07-21 22:48:01","title":"Data Science as Political Action: Grounding Data Science in a Politics\n of Justice","abstract":" In response to recent controversies, the field of data science has rushed to\nadopt codes of ethics. Such professional codes, however, are ill-equipped to\naddress broad matters of social justice. Instead of ethics codes, I argue, the\nfield must embrace politics. Data scientists must recognize themselves as\npolitical actors engaged in normative constructions of society and, as befits\npolitical work, evaluate their work according to its downstream material\nimpacts on people's lives. I justify this notion in two parts: first, by\narticulating why data scientists must recognize themselves as political actors,\nand second, by describing how the field can evolve toward a deliberative and\nrigorous grounding in a politics of social justice. Part 1 responds to three\narguments that are commonly invoked by data scientists when they are challenged\nto take political positions regarding their work. In confronting these\narguments, I will demonstrate why attempting to remain apolitical is itself a\npolitical stance--a fundamentally conservative one--and why the field's current\nattempts to promote \"social good\" dangerously rely on vague and unarticulated\npolitical assumptions. Part 2 proposes a framework for what a\npolitically-engaged data science could look like and how to achieve it,\nrecognizing the challenge of reforming the field in this manner. I\nconceptualize the process of incorporating politics into data science in four\nstages: becoming interested in directly addressing social issues, recognizing\nthe politics underlying these issues, redirecting existing methods toward new\napplications, and, finally, developing new practices and methods that orient\ndata science around a mission of social justice. The path ahead does not\nrequire data scientists to abandon their technical expertise, but it does\nentail expanding their notions of what problems to work on and how to engage\nwith society.\n","authors":"Ben Green","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.03435v3","link_pdf":"http://arxiv.org/pdf/1811.03435v3","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY|cs.LG"} {"id":"1811.02287v1","submitted":"2018-11-06 11:10:38","updated":"2018-11-06 11:10:38","title":"Defining Big Data Analytics Benchmarks for Next Generation\n Supercomputers","abstract":" The design and construction of high performance computing (HPC) systems\nrelies on exhaustive performance analysis and benchmarking. Traditionally this\nactivity has been geared exclusively towards simulation scientists, who,\nunsurprisingly, have been the primary customers of HPC for decades. However,\nthere is a large and growing volume of data science work that requires these\nlarge scale resources, and as such the calls for inclusion and investments in\ndata for HPC have been increasing. So when designing a next generation HPC\nplatform, it is necessary to have HPC-amenable big data analytics benchmarks.\nIn this paper, we propose a set of big data analytics benchmarks and sample\ncodes designed for testing the capabilities of current and next generation\nsupercomputers.\n","authors":"Drew Schmidt|Junqi Yin|Michael Matheson|Bronson Messer|Mallikarjun Shankar","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.02287v1","link_pdf":"http://arxiv.org/pdf/1811.02287v1","link_doi":"","comment":"5 figures","journal_ref":"","doi":"","primary_category":"cs.PF","categories":"cs.PF"} {"id":"1811.02288v1","submitted":"2018-11-06 11:13:09","updated":"2018-11-06 11:13:09","title":"High Dimensional Clustering with $r$-nets","abstract":" Clustering, a fundamental task in data science and machine learning, groups a\nset of objects in such a way that objects in the same cluster are closer to\neach other than to those in other clusters. In this paper, we consider a\nwell-known structure, so-called $r$-nets, which rigorously captures the\nproperties of clustering. We devise algorithms that improve the run-time of\napproximating $r$-nets in high-dimensional spaces with $\\ell_1$ and $\\ell_2$\nmetrics from $\\tilde{O}(dn^{2-\\Theta(\\sqrt{\\epsilon})})$ to $\\tilde{O}(dn +\nn^{2-\\alpha})$, where $\\alpha = \\Omega({\\epsilon^{1/3}}/{\\log(1/\\epsilon)})$.\nThese algorithms are also used to improve a framework that provides approximate\nsolutions to other high dimensional distance problems. Using this framework,\nseveral important related problems can also be solved efficiently, e.g.,\n$(1+\\epsilon)$-approximate $k$th-nearest neighbor distance,\n$(4+\\epsilon)$-approximate Min-Max clustering, $(4+\\epsilon)$-approximate\n$k$-center clustering. In addition, we build an algorithm that\n$(1+\\epsilon)$-approximates greedy permutations in time $\\tilde{O}((dn +\nn^{2-\\alpha}) \\cdot \\log{\\Phi})$ where $\\Phi$ is the spread of the input. This\nalgorithm is used to $(2+\\epsilon)$-approximate $k$-center with the same time\ncomplexity.\n","authors":"Georgia Avarikioti|Alain Ryser|Yuyi Wang|Roger Wattenhofer","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.02288v1","link_pdf":"http://arxiv.org/pdf/1811.02288v1","link_doi":"","comment":"Accepted by AAAI2019","journal_ref":"","doi":"","primary_category":"cs.CG","categories":"cs.CG"} {"id":"1811.03578v5","submitted":"2018-11-08 17:59:50","updated":"2019-08-30 17:16:05","title":"The ASCCR Frame for Learning Essential Collaboration Skills","abstract":" Statistics and data science are especially collaborative disciplines that\ntypically require practitioners to interact with many different people or\ngroups. Consequently, interdisciplinary collaboration skills are part of the\npersonal and professional skills essential for success as an applied\nstatistician or data scientist. These skills are learnable and teachable, and\nlearning and improving collaboration skills provides a way to enhance one's\npractice of statistics and data science. To help individuals learn these skills\nand organizations to teach them, we have developed a framework covering five\nessential components of statistical collaboration: Attitude, Structure,\nContent, Communication, and Relationship. We call this the ASCCR Frame. This\nframework can be incorporated into formal training programs in the classroom or\non the job and can also be used by individuals through self-study. We show how\nthis framework can be applied specifically to statisticians and data scientists\nto improve their collaboration skills and their interdisciplinary impact. We\nbelieve that the ASCCR Frame can help organize and stimulate research and\nteaching in interdisciplinary collaboration and call on individuals and\norganizations to begin generating evidence regarding its effectiveness.\n","authors":"Eric A. Vance|Heather S. Smith","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.03578v5","link_pdf":"http://arxiv.org/pdf/1811.03578v5","link_doi":"","comment":"12 pages, 1 figure. Updated to this Version 5 by adding a few more\n references, discussing how to teach ASCCR in the classroom, calling on others\n to add to research supporting the use of the ASCCR Frame, and adding\n discussion of ethics and reproducible research","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT"} {"id":"1811.03911v1","submitted":"2018-11-09 14:12:03","updated":"2018-11-09 14:12:03","title":"Unsupervised Learnable Sinogram Inpainting Network (SIN) for Limited\n Angle CT reconstruction","abstract":" In this paper, we propose a sinogram inpainting network (SIN) to solve\nlimited-angle CT reconstruction problem, which is a very challenging ill-posed\nissue and of great interest for several clinical applications. A common\napproach to the problem is an iterative reconstruction algorithm with\nregularization term, which can suppress artifacts and improve image quality,\nbut requires high computational cost.\n The starting point of this paper is the proof of inpainting function for\nlimited-angle sinogram is continuous, which can be approached by neural\nnetworks in a data-driven method, granted by the universal approximation\ntheorem. Based on this, we propose SIN as the fitting function -- a\nconvolutional neural network trained to generate missing sinogram data\nconditioned on scanned data. Besides CNN module, we design two differentiable\nand rapid modules, Radon and Inverse Radon Transformer network, to encapsulate\nthe physical model in the training procedure. They enable new joint loss\nfunctions to optimize both sinogram and reconstructed image in sync, which\nimproved the image quality significantly. To tackle the labeled data bottleneck\nin clinical research, we form a sinogram-image-sinogram closed loop, and the\ndifference between sinograms can be used as training loss. In this way, the\nproposed network can be self-trained, with only limited-angle data for\nunsupervised domain adaptation.\n We demonstrate the performance of our proposed network on parallel beam X-ray\nCT in lung CT datasets from Data Science Bowl 2017 and the ability of\nunsupervised transfer learning in Zubal's phantom. The proposed method performs\nbetter than state-of-art method SART-TV in PSNR and SSIM metrics, with\nnoticeable visual improvements in reconstructions.\n","authors":"Ji Zhao|Zhiqiang Chen|Li Zhang|Xin Jin","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.03911v1","link_pdf":"http://arxiv.org/pdf/1811.03911v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"physics.med-ph","categories":"physics.med-ph|eess.IV"} {"id":"1811.03996v4","submitted":"2018-11-09 16:12:40","updated":"2020-03-26 09:51:05","title":"Uncertainty relations and sparse signal recovery","abstract":" This chapter provides a principled introduction to uncertainty relations\nunderlying sparse signal recovery. We start with the seminal work by Donoho and\nStark, 1989, which defines uncertainty relations as upper bounds on the\noperator norm of the band-limitation operator followed by the time-limitation\noperator, generalize this theory to arbitrary pairs of operators, and then\ndevelop -- out of this generalization -- the coherence-based uncertainty\nrelations due to Elad and Bruckstein, 2002, as well as uncertainty relations in\nterms of concentration of $1$-norm or $2$-norm. The theory is completed with\nthe recently discovered set-theoretic uncertainty relations which lead to best\npossible recovery thresholds in terms of a general measure of parsimony, namely\nMinkowski dimension. We also elaborate on the remarkable connection between\nuncertainty relations and the \"large sieve\", a family of inequalities developed\nin analytic number theory. It is finally shown how uncertainty relations allow\nto establish fundamental limits of practical signal recovery problems such as\ninpainting, declipping, super-resolution, and denoising of signals corrupted by\nimpulse noise or narrowband interference. Detailed proofs are provided\nthroughout the chapter.\n","authors":"Erwin Riegler|Helmut Bölcskei","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.03996v4","link_pdf":"http://arxiv.org/pdf/1811.03996v4","link_doi":"","comment":"Chapter in Information-theoretic Methods in Data Science, M.\n Rodrigues and Y. Eldar, Eds., Cambridge University Press, 2020","journal_ref":"","doi":"","primary_category":"cs.IT","categories":"cs.IT|math.IT"} {"id":"1811.05527v1","submitted":"2018-11-13 20:56:14","updated":"2018-11-13 20:56:14","title":"Semi-dual Regularized Optimal Transport","abstract":" Variational problems that involve Wasserstein distances and more generally\noptimal transport (OT) theory are playing an increasingly important role in\ndata sciences. Such problems can be used to form an examplar measure out of\nvarious probability measures, as in the Wasserstein barycenter problem, or to\ncarry out parametric inference and density fitting, where the loss is measured\nin terms of an optimal transport cost to the measure of observations. Despite\nbeing conceptually simple, such problems are computationally challenging\nbecause they involve minimizing over quantities (Wasserstein distances) that\nare themselves hard to compute. Entropic regularization has recently emerged as\nan efficient tool to approximate the solution of such variational Wasserstein\nproblems. In this paper, we give a thorough duality tour of these\nregularization techniques. In particular, we show how important concepts from\nclassical OT such as c-transforms and semi-discrete approaches translate into\nsimilar ideas in a regularized setting. These dual formulations lead to smooth\nvariational problems, which can be solved using smooth, differentiable and\nconvex optimization problems that are simpler to implement and numerically more\nstable that their un-regularized counterparts. We illustrate the versatility of\nthis approach by applying it to the computation of Wasserstein barycenters and\ngradient flows of spatial regularization functionals.\n","authors":"Marco Cuturi|Gabriel Peyré","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.05527v1","link_pdf":"http://arxiv.org/pdf/1811.05527v1","link_doi":"http://dx.doi.org/10.1137/18M1208654","comment":"","journal_ref":"SIAM Review, 60(4), 941-965, 2018","doi":"10.1137/18M1208654","primary_category":"cs.LG","categories":"cs.LG|cs.AI|math.OC|stat.ML"} {"id":"1811.05663v1","submitted":"2018-11-14 06:43:07","updated":"2018-11-14 06:43:07","title":"Correlation between SFR surface density and thermal pressure of ionized\n gas in local analogs of high-redshift galaxies","abstract":" We explore the relation between the star formation rate surface density\n($\\Sigma$SFR) and the interstellar gas pressure for nearby compact starburst\ngalaxies. The sample consists of 17 green peas and 19 Lyman break analogs.\nGreen peas are nearby analogs of Ly$\\alpha$ emitters at high redshift and Lyman\nbreak analogs are nearby analogs of Lyman break galaxies at high redshift. We\nmeasure the sizes for green peas using Hubble Space Telescope Cosmic Origins\nSpectrograph (COS) NUV images with a spatial resolution of $\\sim$ 0.05$^{''}$.\nWe estimate the gas thermal pressure in HII regions by $P = N_{total}Tk{_B}\n\\simeq 2n_{e}Tk{_B}$. The electron density is derived using the [SII] doublet\nat 6716,6731 \\AA, and the temperature is calculated from the [OIII] lines. The\ncorrelation is characterized by $\\Sigma$ SFR =\n2.40$\\times$10$^{-3\\,}$M$_{\\odot\\,}$yr$^{-1\\,}$kpc$^{-2}$$\\left(\\frac{P/k_{B}}{10^{4}cm^{-3}K}\\right)^{1.33}$.\nGreen peas and Lyman break analogs have high $\\Sigma$SFR up to 1.2\nM$_{\\odot\\,}$yr$^{-1\\,}$kpc$^{-2}$ and high thermal pressure in HII region up\nto P/k$_B$ $\\sim$10$^{7.2}{\\rm\\, K\\, cm}^{-3}$. These values are at the highest\nend of the range seen in nearby starburst galaxies. The high gas pressure and\nthe correlation, are in agreement with those found in star-forming galaxies at\nz $\\sim$ 2.5. These extreme pressures are shown to be responsible for driving\ngalactic winds in nearby starbursts. These outflows may be a crucial in\nenabling Lyman-$\\alpha$ and Lyman-continuum to escape.\n","authors":"Tianxing Jiang|Sangeeta Malhotra|Huan Yang|James E. Rhoads","affiliations":"Arizona State University|NASA Goddard Space Flight Center|Las Campanas Observatory, Chile|NASA Goddard Space Flight Center","link_abstract":"http://arxiv.org/abs/1811.05663v1","link_pdf":"http://arxiv.org/pdf/1811.05663v1","link_doi":"http://dx.doi.org/10.3847/1538-4357/aaee79","comment":"16 pages, 6 figures, 1 table; Accepted for publication in ApJ","journal_ref":"","doi":"10.3847/1538-4357/aaee79","primary_category":"astro-ph.GA","categories":"astro-ph.GA"} {"id":"1811.05796v1","submitted":"2018-11-14 14:14:06","updated":"2018-11-14 14:14:06","title":"Direct T$_e$ metallicity calibration of R23 in strong line emitters","abstract":" The gas metallicity of galaxies is often estimated using strong emission\nlines such as the optical lines of [OIII] and [OII]. The most common measure is\n\"R23\", defined as ([OII]$\\lambda$$\\lambda$3726, 3729 +\n[OIII]$\\lambda$$\\lambda$4959,5007)/H$\\beta$. Most calibrations for these\nstrong-line metallicity indicators are for continuum selected galaxies. We\nreport a new empirical calibration of R23 for extreme emission-line galaxies\nusing a large sample of about 800 star-forming green pea galaxies with reliable\nT$_e$-based gas-phase metallicity measurements. This sample is assembled from\nSloan Digital Sky Survey (SDSS) Data Release 13 with the equivalent width of\nthe line [OIII]$\\lambda$5007 $>$ 300 \\AA\\ or the equivalent width of the line\nH$\\beta$ $>$ 100 \\AA\\ in the redshift range 0.011 $<$ z $<$ 0.411. For galaxies\nwith strong emission lines and large ionization parameter (which manifests as\nlog [OIII]$\\lambda$$\\lambda$4959,5007/[OII]$\\lambda$$\\lambda$3726,3729 $\\geq$\n0.6), R23 monotonically increases with log(O/H) and the double-value degeneracy\nis broken. Our calibration provides metallicity estimates that are accurate to\nwithin $\\sim$ 0.14 dex in this regime. Many previous R23 calibrations are found\nto have bias and large scatter for extreme emission-line galaxies. We give\nformulae and plots to directly convert R23 and\n[OIII]$\\lambda$$\\lambda$4959,5007/[OII]$\\lambda$$\\lambda$3726,3729 to log(O/H).\nSince green peas are best nearby analogs of high-redshift Lyman-$\\alpha$\nemitting galaxies, the new calibration offers a good way to estimate the\nmetallicities of both extreme emission-line galaxies and high-redshift\nLyman-$\\alpha$ emitting galaxies. We also report on 15 galaxies with\nmetallicities less than 1/12 solar, with the lowest metallicities being\n12+log(O/H) = 7.25 and 7.26.\n","authors":"Tianxing Jiang|Sangeeta Malhotra|James E. Rhoads|Huan Yang","affiliations":"Arizona State University|NASA Goddard Space Flight Center|NASA Goddard Space Flight Center|Las Campanas Observatory, Chile","link_abstract":"http://arxiv.org/abs/1811.05796v1","link_pdf":"http://arxiv.org/pdf/1811.05796v1","link_doi":"http://dx.doi.org/10.3847/1538-4357/aaee8a","comment":"17 pages, 10 figures; Accepted for publication in ApJ","journal_ref":"","doi":"10.3847/1538-4357/aaee8a","primary_category":"astro-ph.GA","categories":"astro-ph.GA"} {"id":"1811.06397v2","submitted":"2018-11-15 14:35:16","updated":"2019-03-30 16:04:50","title":"Offline Biases in Online Platforms: a Study of Diversity and Homophily\n in Airbnb","abstract":" How diverse are sharing economy platforms? Are they fair marketplaces, where\nall participants operate on a level playing field, or are they large-scale\nonline aggregators of offline human biases? Often portrayed as easy-to-access\ndigital spaces whose participants receive equal opportunities, such platforms\nhave recently come under fire due to reports of discriminatory behaviours among\ntheir users, and have been associated with gentrification phenomena that\nexacerbate preexisting inequalities along racial lines. In this paper, we focus\non the Airbnb sharing economy platform, and analyse the diversity of its user\nbase across five large cities. We find it to be predominantly young, female,\nand white. Notably, we find this to be true even in cities with a diverse\nracial composition. We then introduce a method based on the statistical\nanalysis of networks to quantify behaviours of homophily, heterophily and\navoidance between Airbnb hosts and guests. Depending on cities and property\ntypes, we do find signals of such behaviours relating both to race and gender.\nWe use these findings to provide platform design recommendations, aimed at\nexposing and possibly reducing the biases we detect, in support of a more\ninclusive growth of sharing economy platforms.\n","authors":"Victoria Koh|Weihua Li|Giacomo Livan|Licia Capra","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.06397v2","link_pdf":"http://arxiv.org/pdf/1811.06397v2","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-019-0189-5","comment":"17 pages, 1 figure","journal_ref":"EPJ Data Science 8:11 (2019)","doi":"10.1140/epjds/s13688-019-0189-5","primary_category":"cs.SI","categories":"cs.SI|cs.CY|physics.soc-ph"} {"id":"1811.06433v2","submitted":"2018-11-15 15:38:36","updated":"2020-02-12 12:32:13","title":"On a minimum distance procedure for threshold selection in tail analysis","abstract":" Power-law distributions have been widely observed in different areas of\nscientific research. Practical estimation issues include how to select a\nthreshold above which observations follow a power-law distribution and then how\nto estimate the power-law tail index. A minimum distance selection procedure\n(MDSP) is proposed in Clauset et al. (2009) and has been widely adopted in\npractice, especially in the analyses of social networks. However, theoretical\njustifications for this selection procedure remain scant. In this paper, we\nstudy the asymptotic behavior of the selected threshold and the corresponding\npower-law index given by the MDSP. We find that the MDSP tends to choose too\nhigh a threshold level and leads to Hill estimates with large variances and\nroot mean squared errors for simulated data with Pareto-like tails.\n","authors":"Holger Drees|Anja Janßen|Sidney I. Resnick|Tiandong Wang","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.06433v2","link_pdf":"http://arxiv.org/pdf/1811.06433v2","link_doi":"http://dx.doi.org/10.1137/19M1260463","comment":"","journal_ref":"SIAM Journal on Mathematics of Data Science 2020 2:1, 75-102","doi":"10.1137/19M1260463","primary_category":"math.ST","categories":"math.ST|stat.ME|stat.TH|60G70, 62E20, 60G15, 62G30, 05C80"} {"id":"1811.06446v1","submitted":"2018-11-15 16:07:49","updated":"2018-11-15 16:07:49","title":"Preliminary Studies on a Large Face Database","abstract":" We perform preliminary studies on a large longitudinal face database\nMORPH-II, which is a benchmark dataset in the field of computer vision and\npattern recognition. First, we summarize the inconsistencies in the dataset and\nintroduce the steps and strategy taken for cleaning. The potential implications\nof these inconsistencies on prior research are introduced. Next, we propose a\nnew automatic subsetting scheme for evaluation protocol. It is intended to\novercome the unbalanced racial and gender distributions of MORPH-II, while\nensuring independence between training and testing sets. Finally, we contribute\na novel global framework for age estimation that utilizes posterior\nprobabilities from the race classification step to compute a racecomposite age\nestimate. Preliminary experimental results on MORPH-II are presented.\n","authors":"Benjamin Yip|Garrett Bingham|Katherine Kempfert|Jonathan Fabish|Troy Kling|Cuixian Chen|Yishi Wang","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.06446v1","link_pdf":"http://arxiv.org/pdf/1811.06446v1","link_doi":"","comment":"It has been accepted in the 5th National Symposium for NSF REU\n Research in Data Science, Systems, and Security. G. Bingham and K. Kempfert\n contributed equally","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1811.07344v1","submitted":"2018-11-18 16:11:52","updated":"2018-11-18 16:11:52","title":"Transfer Learning with Deep CNNs for Gender Recognition and Age\n Estimation","abstract":" In this project, competition-winning deep neural networks with pretrained\nweights are used for image-based gender recognition and age estimation.\nTransfer learning is explored using both VGG19 and VGGFace pretrained models by\ntesting the effects of changes in various design schemes and training\nparameters in order to improve prediction accuracy. Training techniques such as\ninput standardization, data augmentation, and label distribution age encoding\nare compared. Finally, a hierarchy of deep CNNs is tested that first classifies\nsubjects by gender, and then uses separate male and female age models to\npredict age. A gender recognition accuracy of 98.7% and an MAE of 4.1 years is\nachieved. This paper shows that, with proper training techniques, good results\ncan be obtained by retasking existing convolutional filters towards a new\npurpose.\n","authors":"Philip Smith|Cuixian Chen","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.07344v1","link_pdf":"http://arxiv.org/pdf/1811.07344v1","link_doi":"","comment":"It has been accepted in the 5th National Symposium for NSF REU\n Research in Data Science, Systems, and Security","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1811.07974v1","submitted":"2018-11-19 20:51:30","updated":"2018-11-19 20:51:30","title":"A Map of Knowledge","abstract":" Knowledge representation has gained in relevance as data from the ubiquitous\ndigitization of behaviors amass and academia and industry seek methods to\nunderstand and reason about the information they encode. Success in this\npursuit has emerged with data from natural language, where skip-grams and other\nlinear connectionist models of distributed representation have surfaced\nscrutable relational structures which have also served as artifacts of\nanthropological interest. Natural language is, however, only a fraction of the\nbig data deluge. Here we show that latent semantic structure, comprised of\nelements from digital records of our interactions, can be informed by\nbehavioral data and that domain knowledge can be extracted from this structure\nthrough visualization and a novel mapping of the literal descriptions of\nelements onto this behaviorally informed representation. We use the course\nenrollment behaviors of 124,000 students at a public university to learn vector\nrepresentations of its courses. From these behaviorally informed\nrepresentations, a notable 88% of course attribute information were recovered\n(e.g., department and division), as well as 40% of course relationships\nconstructed from prior domain knowledge and evaluated by analogy (e.g., Math 1B\nis to Math H1B as Physics 7B is to Physics H7B). To aid in interpretation of\nthe learned structure, we create a semantic interpolation, translating course\nvectors to a bag-of-words of their respective catalog descriptions. We find\nthat the representations learned from enrollments resolved course vectors to a\nlevel of semantic fidelity exceeding that of their catalog descriptions,\ndepicting a vector space of high conceptual rationality. We end with a\ndiscussion of the possible mechanisms by which this knowledge structure may be\ninformed and its implications for data science.\n","authors":"Zachary A. Pardos|Andrew Joo Hun Nam","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.07974v1","link_pdf":"http://arxiv.org/pdf/1811.07974v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1811.09496v2","submitted":"2018-11-23 14:36:23","updated":"2019-02-01 12:42:40","title":"The Error is the Feature: how to Forecast Lightning using a Model\n Prediction Error","abstract":" Despite the progress within the last decades, weather forecasting is still a\nchallenging and computationally expensive task. Current satellite-based\napproaches to predict thunderstorms are usually based on the analysis of the\nobserved brightness temperatures in different spectral channels and emit a\nwarning if a critical threshold is reached. Recent progress in data science\nhowever demonstrates that machine learning can be successfully applied to many\nresearch fields in science, especially in areas dealing with large datasets. We\ntherefore present a new approach to the problem of predicting thunderstorms\nbased on machine learning. The core idea of our work is to use the error of\ntwo-dimensional optical flow algorithms applied to images of meteorological\nsatellites as a feature for machine learning models. We interpret that optical\nflow error as an indication of convection potentially leading to thunderstorms\nand lightning. To factor in spatial proximity we use various manual convolution\nsteps. We also consider effects such as the time of day or the geographic\nlocation. We train different tree classifier models as well as a neural network\nto predict lightning within the next few hours (called nowcasting in\nmeteorology) based on these features. In our evaluation section we compare the\npredictive power of the different models and the impact of different features\non the classification result. Our results show a high accuracy of 96% for\npredictions over the next 15 minutes which slightly decreases with increasing\nforecast period but still remains above 83% for forecasts of up to five hours.\nThe high false positive rate of nearly 6% however needs further investigation\nto allow for an operational use of our approach.\n","authors":"Christian Schön|Jens Dittrich|Richard Müller","affiliations":"Saarland Informatics Campus|Saarland Informatics Campus|Deutscher Wetterdienst","link_abstract":"http://arxiv.org/abs/1811.09496v2","link_pdf":"http://arxiv.org/pdf/1811.09496v2","link_doi":"http://dx.doi.org/10.1145/3292500.3330682","comment":"10 pages, 7 figures","journal_ref":"","doi":"10.1145/3292500.3330682","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1811.10470v2","submitted":"2018-11-26 16:06:17","updated":"2018-12-21 14:38:44","title":"Analysis of large sparse graphs using regular decomposition of graph\n distance matrices","abstract":" Statistical analysis of large and sparse graphs is a challenging problem in\ndata science due to the high dimensionality and nonlinearity of the problem.\nThis paper presents a fast and scalable algorithm for partitioning such graphs\ninto disjoint groups based on observed graph distances from a set of reference\nnodes. The resulting partition provides a low-dimensional approximation of the\nfull distance matrix which helps to reveal global structural properties of the\ngraph using only small samples of the distance matrix. The presented algorithm\nis inspired by the information-theoretic minimum description principle. We\ninvestigate the performance of this algorithm for selected real data sets and\nfor synthetic graph data sets generated using stochastic block models and\npower-law random graphs, together with analytical considerations for sparse\nstochastic block models with bounded average degrees.\n","authors":"Hannu Reittu|Lasse Leskelä|Tomi Räty|Marco Fiorucci","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.10470v2","link_pdf":"http://arxiv.org/pdf/1811.10470v2","link_doi":"","comment":"IEEE BigData 2018 Conference Workshop, Advances in High Dimensional\n Big Data, 10.-13.12. 2018, Seattle USA","journal_ref":"","doi":"","primary_category":"cs.DS","categories":"cs.DS|cs.SI"} {"id":"1811.11242v1","submitted":"2018-11-27 20:26:33","updated":"2018-11-27 20:26:33","title":"Wrangling Messy CSV Files by Detecting Row and Type Patterns","abstract":" It is well known that data scientists spend the majority of their time on\npreparing data for analysis. One of the first steps in this preparation phase\nis to load the data from the raw storage format. Comma-separated value (CSV)\nfiles are a popular format for tabular data due to their simplicity and\nostensible ease of use. However, formatting standards for CSV files are not\nfollowed consistently, so each file requires manual inspection and potentially\nrepair before the data can be loaded, an enormous waste of human effort for a\ntask that should be one of the simplest parts of data science. The first and\nmost essential step in retrieving data from CSV files is deciding on the\ndialect of the file, such as the cell delimiter and quote character. Existing\ndialect detection approaches are few and non-robust. In this paper, we propose\na dialect detection method based on a novel measure of data consistency of\nparsed data files. Our method achieves 97% overall accuracy on a large corpus\nof real-world CSV files and improves the accuracy on messy CSV files by almost\n22% compared to existing approaches, including those in the Python standard\nlibrary.\n","authors":"Gerrit J. J. van den Burg|Alfredo Nazabal|Charles Sutton","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.11242v1","link_pdf":"http://arxiv.org/pdf/1811.11242v1","link_doi":"http://dx.doi.org/10.1007/s10618-019-00646-y","comment":"","journal_ref":"Data Mining and Knowledge Discovery (July, 2019)","doi":"10.1007/s10618-019-00646-y","primary_category":"cs.DB","categories":"cs.DB|E.5; H.2.8"} {"id":"1811.11620v1","submitted":"2018-11-28 15:19:45","updated":"2018-11-28 15:19:45","title":"Multi-step Time Series Forecasting Using Ridge Polynomial Neural Network\n with Error-Output Feedbacks","abstract":" Time series forecasting gets much attention due to its impact on many\npractical applications. Higher-order neural network with recurrent feedback is\na powerful technique which used successfully for forecasting. It maintains fast\nlearning and the ability to learn the dynamics of the series over time. For\nthat, in this paper, we propose a novel model which is called Ridge Polynomial\nNeural Network with Error-Output Feedbacks (RPNN-EOFs) that combines the\nproperties of higher order and error-output feedbacks. The well-known\nMackey-Glass time series is used to test the forecasting capability of\nRPNN-EOFS. Simulation results showed that the proposed RPNN-EOFs provides\nbetter understanding for the Mackey-Glass time series with root mean square\nerror equal to 0.00416. This result is smaller than other models in the\nliterature. Therefore, we can conclude that the RPNN-EOFs can be applied\nsuccessfully for time series forecasting.\n","authors":"Waddah Waheeb|Rozaida Ghazali","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.11620v1","link_pdf":"http://arxiv.org/pdf/1811.11620v1","link_doi":"http://dx.doi.org/10.1007/978-981-10-2777-2_5","comment":"This is a pre-print of an article published in the International\n Conference on Soft Computing in Data Science, 2016. The final authenticated\n version is available online at:\n http://link.springer.com/chapter/10.1007/978-981-10-2777-2_5","journal_ref":"","doi":"10.1007/978-981-10-2777-2_5","primary_category":"cs.LG","categories":"cs.LG|cs.NE|stat.ML"} {"id":"1811.11960v1","submitted":"2018-11-29 04:34:48","updated":"2018-11-29 04:34:48","title":"Prediction Factory: automated development and collaborative evaluation\n of predictive models","abstract":" In this paper, we present a data science automation system called Prediction\nFactory. The system uses several key automation algorithms to enable data\nscientists to rapidly develop predictive models and share them with domain\nexperts. To assess the system's impact, we implemented 3 different interfaces\nfor creating predictive modeling projects: baseline automation, full\nautomation, and optional automation. With a dataset of online grocery shopper\nbehaviors, we divided data scientists among the interfaces to specify\nprediction problems, learn and evaluate models, and write a report for domain\nexperts to judge whether or not to fund to continue working on. In total, 22\ndata scientists created 94 reports that were judged 296 times by 26 experts. In\na head-to-head trial, reports generated utilizing full data science automation\ninterface reports were funded 57.5% of the time, while the ones that used\nbaseline automation were only funded 42.5% of the time. An intermediate\ninterface which supports optional automation generated reports were funded\n58.6% more often compared to the baseline. Full automation and optional\nautomation reports were funded about equally when put head-to-head. These\nresults demonstrate that Prediction Factory has implemented a critical amount\nof automation to augment the role of data scientists and improve business\noutcomes.\n","authors":"Gaurav Sheni|Benjamin Schreck|Roy Wedge|James Max Kanter|Kalyan Veeramachaneni","affiliations":"","link_abstract":"http://arxiv.org/abs/1811.11960v1","link_pdf":"http://arxiv.org/pdf/1811.11960v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1812.01106v1","submitted":"2018-11-30 18:51:16","updated":"2018-11-30 18:51:16","title":"Improving Traffic Safety Through Video Analysis in Jakarta, Indonesia","abstract":" This project presents the results of a partnership between the Data Science\nfor Social Good fellowship, Jakarta Smart City and Pulse Lab Jakarta to create\na video analysis pipeline for the purpose of improving traffic safety in\nJakarta. The pipeline transforms raw traffic video footage into databases that\nare ready to be used for traffic analysis. By analyzing these patterns, the\ncity of Jakarta will better understand how human behavior and built\ninfrastructure contribute to traffic challenges and safety risks. The results\nof this work should also be broadly applicable to smart city initiatives around\nthe globe as they improve urban planning and sustainability through data\nscience approaches.\n","authors":"João Caldeira|Alex Fout|Aniket Kesari|Raesetje Sefala|Joseph Walsh|Katy Dupre|Muhammad Rizal Khaefi| Setiaji|George Hodge|Zakiya Aryana Pramestri|Muhammad Adib Imtiyazi","affiliations":"","link_abstract":"http://arxiv.org/abs/1812.01106v1","link_pdf":"http://arxiv.org/pdf/1812.01106v1","link_doi":"http://dx.doi.org/10.1007/978-3-030-29513-4_48","comment":"6 pages; LaTeX; Presented at NeurIPS 2018 Workshop on Machine\n Learning for the Developing World; Presented at NeurIPS 2018 Workshop on AI\n for Social Good","journal_ref":"Proceedings of the 2019 Intelligent Systems Conference\n (IntelliSys) Volume 2, 642-649","doi":"10.1007/978-3-030-29513-4_48","primary_category":"cs.CY","categories":"cs.CY|cs.CV|cs.LG|stat.ML"} {"id":"1812.01077v1","submitted":"2018-12-03 20:46:27","updated":"2018-12-03 20:46:27","title":"Brief survey of Mobility Analyses based on Mobile Phone Datasets","abstract":" This is a brief survey of the research performed by Grandata Labs in\ncollaboration with numerous academic groups around the world on the topic of\nhuman mobility. A driving theme in these projects is to use and improve Data\nScience techniques to understand mobility, as it can be observed through the\nlens of mobile phone datasets. We describe applications of mobility analyses\nfor urban planning, prediction of data traffic usage, building delay tolerant\nnetworks, generating epidemiologic risk maps and measuring the predictability\nof human mobility.\n","authors":"Carlos Sarraute|Martin Minnoni","affiliations":"","link_abstract":"http://arxiv.org/abs/1812.01077v1","link_pdf":"http://arxiv.org/pdf/1812.01077v1","link_doi":"","comment":"Workshop on Urban Computing and Society. Petropolis, RJ, Brazil. Nov\n 28, 2018","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI|cs.CY|cs.LG|stat.ML"} {"id":"1812.02893v1","submitted":"2018-12-07 03:45:27","updated":"2018-12-07 03:45:27","title":"The Calabi-Yau Landscape: from Geometry, to Physics, to Machine-Learning","abstract":" We present a pedagogical introduction to the recent advances in the\ncomputational geometry, physical implications, and data science of Calabi-Yau\nmanifolds. Aimed at the beginning research student and using Calabi-Yau spaces\nas an exciting play-ground, we intend to teach some mathematics to the budding\nphysicist, some physics to the budding mathematician, and some machine-learning\nto both. Based on various lecture series, colloquia and seminars given by the\nauthor in the past year, this writing is a very preliminary draft of a book to\nappear with Springer, by whose kind permission we post to ArXiv for comments\nand suggestions.\n","authors":"Yang-Hui He","affiliations":"","link_abstract":"http://arxiv.org/abs/1812.02893v1","link_pdf":"http://arxiv.org/pdf/1812.02893v1","link_doi":"","comment":"#pages = #irreps(Monster), 44 figures, book to appear with Springer","journal_ref":"","doi":"","primary_category":"hep-th","categories":"hep-th|math-ph|math.AG|math.MP|stat.ML"} {"id":"1812.03350v2","submitted":"2018-12-08 16:49:32","updated":"2018-12-19 16:47:21","title":"Adaptive and Calibrated Ensemble Learning with Dependent Tail-free\n Process","abstract":" Ensemble learning is a mainstay in modern data science practice. Conventional\nensemble algorithms assigns to base models a set of deterministic, constant\nmodel weights that (1) do not fully account for variations in base model\naccuracy across subgroups, nor (2) provide uncertainty estimates for the\nensemble prediction, which could result in mis-calibrated (i.e. precise but\nbiased) predictions that could in turn negatively impact the algorithm\nperformance in real-word applications. In this work, we present an adaptive,\nprobabilistic approach to ensemble learning using dependent tail-free process\nas ensemble weight prior. Given input feature $\\mathbf{x}$, our method\noptimally combines base models based on their predictive accuracy in the\nfeature space $\\mathbf{x} \\in \\mathcal{X}$, and provides interpretable\nuncertainty estimates both in model selection and in ensemble prediction. To\nencourage scalable and calibrated inference, we derive a structured variational\ninference algorithm that jointly minimize KL objective and the model's\ncalibration score (i.e. Continuous Ranked Probability Score (CRPS)). We\nillustrate the utility of our method on both a synthetic nonlinear function\nregression task, and on the real-world application of spatio-temporal\nintegration of particle pollution prediction models in New England.\n","authors":"Jeremiah Zhe Liu|John Paisley|Marianthi-Anna Kioumourtzoglou|Brent A. Coull","affiliations":"","link_abstract":"http://arxiv.org/abs/1812.03350v2","link_pdf":"http://arxiv.org/pdf/1812.03350v2","link_doi":"","comment":"Work-in-progress manuscript appeared at Bayesian Nonparametrics\n Workshop, Neural Information Processing Systems 2018","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1812.03953v2","submitted":"2018-12-10 18:08:18","updated":"2019-02-20 22:57:02","title":"An Intelligent Safety System for Human-Centered Semi-Autonomous Vehicles","abstract":" Nowadays, automobile manufacturers make efforts to develop ways to make cars\nfully safe. Monitoring driver's actions by computer vision techniques to detect\ndriving mistakes in real-time and then planning for autonomous driving to avoid\nvehicle collisions is one of the most important issues that has been\ninvestigated in the machine vision and Intelligent Transportation Systems\n(ITS). The main goal of this study is to prevent accidents caused by fatigue,\ndrowsiness, and driver distraction. To avoid these incidents, this paper\nproposes an integrated safety system that continuously monitors the driver's\nattention and vehicle surroundings, and finally decides whether the actual\nsteering control status is safe or not. For this purpose, we equipped an\nordinary car called FARAZ with a vision system consisting of four mounted\ncameras along with a universal car tool for communicating with surrounding\nfactory-installed sensors and other car systems, and sending commands to\nactuators. The proposed system leverages a scene understanding pipeline using\ndeep convolutional encoder-decoder networks and a driver state detection\npipeline. We have been identifying and assessing domestic capabilities for the\ndevelopment of technologies specifically of the ordinary vehicles in order to\nmanufacture smart cars and eke providing an intelligent system to increase\nsafety and to assist the driver in various conditions/situations.\n","authors":"Hadi Abdi Khojasteh|Alireza Abbas Alipour|Ebrahim Ansari|Parvin Razzaghi","affiliations":"","link_abstract":"http://arxiv.org/abs/1812.03953v2","link_pdf":"http://arxiv.org/pdf/1812.03953v2","link_doi":"http://dx.doi.org/10.1007/978-3-030-37309-2_26","comment":"15 pages and 5 figures, Submitted to the international conference on\n Contemporary issues in Data Science (CiDaS 2019), Learn more about this\n project at https://iasbs.ac.ir/~ansari/faraz","journal_ref":"Nature Switzerland AG - Springer LNDECT 45(2020) 322-336","doi":"10.1007/978-3-030-37309-2_26","primary_category":"cs.CV","categories":"cs.CV|cs.LG|cs.RO"} {"id":"1812.06007v1","submitted":"2018-12-14 16:22:18","updated":"2018-12-14 16:22:18","title":"The PowerURV algorithm for computing rank-revealing full factorizations","abstract":" Many applications in scientific computing and data science require the\ncomputation of a rank-revealing factorization of a large matrix. In many of\nthese instances the classical algorithms for computing the singular value\ndecomposition are prohibitively computationally expensive. The randomized\nsingular value decomposition can often be helpful, but is not effective unless\nthe numerical rank of the matrix is substantially smaller than the dimensions\nof the matrix. We introduce a new randomized algorithm for producing\nrank-revealing factorizations based on existing work by Demmel, Dumitriu and\nHoltz [Numerische Mathematik, 108(1), 2007] that excels in this regime. The\nmethod is exceptionally easy to implement, and results in close-to optimal\nlow-rank approximations to a given matrix. The vast majority of floating point\noperations are executed in level-3 BLAS, which leads to high computational\nspeeds. The performance of the method is illustrated via several numerical\nexperiments that directly compare it to alternative techniques such as the\ncolumn pivoted QR factorization, or the QLP method by Stewart.\n","authors":"Abinand Gopal|Per-Gunnar Martinsson","affiliations":"","link_abstract":"http://arxiv.org/abs/1812.06007v1","link_pdf":"http://arxiv.org/pdf/1812.06007v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.NA","categories":"math.NA|cs.NA"} {"id":"1812.06070v2","submitted":"2018-12-14 18:36:06","updated":"2019-07-23 12:48:54","title":"The Boosted DC Algorithm for nonsmooth functions","abstract":" The Boosted Difference of Convex functions Algorithm (BDCA) was recently\nproposed for minimizing smooth difference of convex (DC) functions. BDCA\naccelerates the convergence of the classical Difference of Convex functions\nAlgorithm (DCA) thanks to an additional line search step. The purpose of this\npaper is twofold. Firstly, to show that this scheme can be generalized and\nsuccessfully applied to certain types of nonsmooth DC functions, namely, those\nthat can be expressed as the difference of a smooth function and a possibly\nnonsmooth one. Secondly, to show that there is complete freedom in the choice\nof the trial step size for the line search, which is something that can further\nimprove its performance. We prove that any limit point of the BDCA iterative\nsequence is a critical point of the problem under consideration, and that the\ncorresponding objective value is monotonically decreasing and convergent. The\nglobal convergence and convergent rate of the iterations are obtained under the\nKurdyka-Lojasiewicz property. Applications and numerical experiments for two\nproblems in data science are presented, demonstrating that BDCA outperforms\nDCA. Specifically, for the Minimum Sum-of-Squares Clustering problem, BDCA was\non average sixteen times faster than DCA, and for the Multidimensional Scaling\nproblem, BDCA was three times faster than DCA.\n","authors":"Francisco J. Aragón Artacho|Phan T. Vuong","affiliations":"","link_abstract":"http://arxiv.org/abs/1812.06070v2","link_pdf":"http://arxiv.org/pdf/1812.06070v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.OC","categories":"math.OC|cs.LG|65K05, 65K10, 90C26, 47N10"} {"id":"1812.08032v2","submitted":"2018-12-19 15:45:03","updated":"2019-09-12 17:02:46","title":"Progressive Data Science: Potential and Challenges","abstract":" Data science requires time-consuming iterative manual activities. In\nparticular, activities such as data selection, preprocessing, transformation,\nand mining, highly depend on iterative trial-and-error processes that could be\nsped-up significantly by providing quick feedback on the impact of changes. The\nidea of progressive data science is to compute the results of changes in a\nprogressive manner, returning a first approximation of results quickly and\nallow iterative refinements until converging to a final result. Enabling the\nuser to interact with the intermediate results allows an early detection of\nerroneous or suboptimal choices, the guided definition of modifications to the\npipeline and their quick assessment. In this paper, we discuss the\nprogressiveness challenges arising in different steps of the data science\npipeline. We describe how changes in each step of the pipeline impact the\nsubsequent steps and outline why progressive data science will help to make the\nprocess more effective. Computing progressive approximations of outcomes\nresulting from changes creates numerous research challenges, especially if the\nchanges are made in the early steps of the pipeline. We discuss these\nchallenges and outline first steps towards progressiveness, which, we argue,\nwill ultimately help to significantly speed-up the overall data science\nprocess.\n","authors":"Cagatay Turkay|Nicola Pezzotti|Carsten Binnig|Hendrik Strobelt|Barbara Hammer|Daniel A. Keim|Jean-Daniel Fekete|Themis Palpanas|Yunhai Wang|Florin Rusu","affiliations":"","link_abstract":"http://arxiv.org/abs/1812.08032v2","link_pdf":"http://arxiv.org/pdf/1812.08032v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.HC","categories":"cs.HC|cs.DB|cs.LG|H.5.2; H.3.m; I.2.m; I.3.m"} {"id":"1901.02704v1","submitted":"2018-12-21 11:37:23","updated":"2018-12-21 11:37:23","title":"Cluster Lifecycle Analysis: Challenges, Techniques, and Framework","abstract":" Novel forms of data analysis methods have emerged as a significant research\ndirection in the transportation domain. These methods can potentially help to\nimprove our understanding of the dynamic flows of vehicles, people, and goods.\nUnderstanding these dynamics has economic and social consequences, which can\nimprove the quality of life locally or worldwide. Aiming at this objective, a\nsignificant amount of research has focused on clustering moving objects to\naddress problems in many domains, including the transportation, health, and\nenvironment. However, previous research has not investigated the lifecycle of a\ncluster, including cluster genesis, existence, and disappearance. The\nrepresentation and analysis of cluster lifecycles can create novel avenues for\nresearch, result in new insights for analyses, and allow unique forms of\nprediction. This technical report focuses on studying the lifecycle of clusters\nby investigating the relations that a cluster has with moving elements and\nother clusters. This technical report also proposes a big data framework that\nmanages the identification and processing of a cluster lifecycle. The ongoing\nresearch approach will lead to new ways to perform cluster analysis and advance\nthe state of the art by leading to new insights related to cluster lifecycle.\nThese results can have a significant impact on transport industry data science\napplications in a wide variety of areas, including congestion management,\nresource optimization, and hotspot management.\n","authors":"Ivens Portugal|Paulo Alencar|Donald Cowan","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.02704v1","link_pdf":"http://arxiv.org/pdf/1901.02704v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1812.10042v1","submitted":"2018-12-25 05:58:14","updated":"2018-12-25 05:58:14","title":"On discrimination between the Lindley and xgamma distributions","abstract":" For a given data set the problem of selecting either Lindley or xgamma\ndistribution with unknown parameter is investigated in this article. Both these\ndistributions can be used quite effectively for analyzing skewed non-negative\ndata and in modeling time-to-event data sets. We have used the ratio of the\nmaximized likelihoods in choosing between the Lindley and xgamma distributions.\nAsymptotic distributions of the ratio of the maximized likelihoods are obtained\nand those are utilized to determine the minimum sample size required to\ndiscriminate between these two distributions for user specified probability of\ncorrect selection and tolerance limit.\n","authors":"Subhradev Sen|Hazem Al-Mofleh|Sudhansu S. Maiti","affiliations":"","link_abstract":"http://arxiv.org/abs/1812.10042v1","link_pdf":"http://arxiv.org/pdf/1812.10042v1","link_doi":"http://dx.doi.org/10.1007/s40745-020-00243-7","comment":"17 pages, Communicated Article","journal_ref":"Annals of Data Science, 2020","doi":"10.1007/s40745-020-00243-7","primary_category":"stat.ME","categories":"stat.ME|60E05, 62E20, 62F03"} {"id":"1812.10176v1","submitted":"2018-12-25 23:44:07","updated":"2018-12-25 23:44:07","title":"A Variability-Aware Design Approach to the Data Analysis Modeling\n Process","abstract":" The massive amount of current data has led to many different forms of data\nanalysis processes that aim to explore this data to uncover valuable insights.\nMethodologies to guide the development of big data science projects, including\nCRISP-DM and SEMMA, have been widely used in industry and academia. The data\nanalysis modeling phase, which involves decisions on the most appropriate\nmodels to adopt, is at the core of these projects. However, from a software\nengineering perspective, the design and automation of activities performed in\nthis phase are challenging. In this paper, we propose an approach to the data\nanalysis modeling process which involves (i) the assessment of the variability\ninherent in the CRISP-DM data analysis modeling phase and the provision of\nfeature models that represent this variability; (ii) the definition of a\nframework structural design that captures the identified variability; and (iii)\nevaluation of the developed framework design in terms of the possibilities for\nprocess automation. The proposed approach advances the state of the art by\noffering a variability-aware design solution that can enhance system\nflexibility, potentially leading to novel software frameworks which can\nsignificantly improve the level of automation in data analysis modeling\nprocess.\n","authors":"Maria Cristina Vale Tavares|Paulo Alencar|Donald Cowan","affiliations":"","link_abstract":"http://arxiv.org/abs/1812.10176v1","link_pdf":"http://arxiv.org/pdf/1812.10176v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.SE","categories":"cs.SE"} {"id":"1812.10575v2","submitted":"2018-12-26 23:44:28","updated":"2019-09-24 04:13:27","title":"Random batch methods (RBM) for interacting particle systems","abstract":" We develop Random Batch Methods for interacting particle systems with large\nnumber of particles. These methods use small but random batches for particle\ninteractions, thus the computational cost is reduced from $O(N^2)$ per time\nstep to $O(N)$, for a system with $N$ particles with binary interactions. On\none hand, these methods are efficient Asymptotic-Preserving schemes for the\nunderlying particle systems, allowing $N$-independent time steps and also\ncapture, in the $N \\to \\infty$ limit, the solution of the mean field limit\nwhich are nonlinear Fokker-Planck equations; on the other hand, the stochastic\nprocesses generated by the algorithms can also be regarded as new models for\nthe underlying problems. For one of the methods, we give a particle number\nindependent error estimate under some special interactions. Then, we apply\nthese methods to some representative problems in mathematics, physics, social\nand data sciences, including the Dyson Brownian motion from random matrix\ntheory, Thomson's problem, distribution of wealth, opinion dynamics and\nclustering. Numerical results show that the methods can capture both the\ntransient solutions and the global equilibrium in these problems.\n","authors":"Shi Jin|Lei Li|Jian-Guo Liu","affiliations":"","link_abstract":"http://arxiv.org/abs/1812.10575v2","link_pdf":"http://arxiv.org/pdf/1812.10575v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.NA","categories":"math.NA|cs.NA|math.DS|math.PR|37M25, 65C20, 60H30"} {"id":"1901.00555v3","submitted":"2019-01-02 23:56:10","updated":"2019-11-25 05:34:42","title":"An Introductory Guide to Fano's Inequality with Applications in\n Statistical Estimation","abstract":" Information theory plays an indispensable role in the development of\nalgorithm-independent impossibility results, both for communication problems\nand for seemingly distinct areas such as statistics and machine learning. While\nnumerous information-theoretic tools have been proposed for this purpose, the\noldest one remains arguably the most versatile and widespread: Fano's\ninequality. In this chapter, we provide a survey of Fano's inequality and its\nvariants in the context of statistical estimation, adopting a versatile\nframework that covers a wide range of specific problems. We present a variety\nof key tools and techniques used for establishing impossibility results via\nthis approach, and provide representative examples covering group testing,\ngraphical model selection, sparse linear regression, density estimation, and\nconvex optimization.\n","authors":"Jonathan Scarlett|Volkan Cevher","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.00555v3","link_pdf":"http://arxiv.org/pdf/1901.00555v3","link_doi":"","comment":"Chapter in upcoming book \"Information-Theoretic Methods in Data\n Science\" (Cambridge University Press) edited by Yonina Eldar and Miguel\n Rodrigues. (v2 & v3) Minor corrections and edits","journal_ref":"","doi":"","primary_category":"cs.IT","categories":"cs.IT|cs.LG|math.IT|math.ST|stat.ML|stat.TH"} {"id":"1901.00630v1","submitted":"2019-01-03 06:47:37","updated":"2019-01-03 06:47:37","title":"Projecting \"better than randomly\": How to reduce the dimensionality of\n very large datasets in a way that outperforms random projections","abstract":" For very large datasets, random projections (RP) have become the tool of\nchoice for dimensionality reduction. This is due to the computational\ncomplexity of principal component analysis. However, the recent development of\nrandomized principal component analysis (RPCA) has opened up the possibility of\nobtaining approximate principal components on very large datasets. In this\npaper, we compare the performance of RPCA and RP in dimensionality reduction\nfor supervised learning. In Experiment 1, study a malware classification task\non a dataset with over 10 million samples, almost 100,000 features, and over 25\nbillion non-zero values, with the goal of reducing the dimensionality to a\ncompressed representation of 5,000 features. In order to apply RPCA to this\ndataset, we develop a new algorithm called large sample RPCA (LS-RPCA), which\nextends the RPCA algorithm to work on datasets with arbitrarily many samples.\nWe find that classification performance is much higher when using LS-RPCA for\ndimensionality reduction than when using random projections. In particular,\nacross a range of target dimensionalities, we find that using LS-RPCA reduces\nclassification error by between 37% and 54%. Experiment 2 generalizes the\nphenomenon to multiple datasets, feature representations, and classifiers.\nThese findings have implications for a large number of research projects in\nwhich random projections were used as a preprocessing step for dimensionality\nreduction. As long as accuracy is at a premium and the target dimensionality is\nsufficiently less than the numeric rank of the dataset, randomized PCA may be a\nsuperior choice. Moreover, if the dataset has a large number of samples, then\nLS-RPCA will provide a method for obtaining the approximate principal\ncomponents.\n","authors":"Michael Wojnowicz|Di Zhang|Glenn Chisholm|Xuan Zhao|Matt Wolff","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.00630v1","link_pdf":"http://arxiv.org/pdf/1901.00630v1","link_doi":"","comment":"Originally published in IEEE DSAA in 2016; this post-print fixes a\n rendering error of the += operator in Algorithm 3","journal_ref":"2016 IEEE 3rd International Conference on Data Science and\n Advanced Analytics (DSAA) (pp. 184-193). IEEE","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1901.02547v1","submitted":"2019-01-08 22:56:45","updated":"2019-01-08 22:56:45","title":"Problem Formulation and Fairness","abstract":" Formulating data science problems is an uncertain and difficult process. It\nrequires various forms of discretionary work to translate high-level objectives\nor strategic goals into tractable problems, necessitating, among other things,\nthe identification of appropriate target variables and proxies. While these\nchoices are rarely self-evident, normative assessments of data science projects\noften take them for granted, even though different translations can raise\nprofoundly different ethical concerns. Whether we consider a data science\nproject fair often has as much to do with the formulation of the problem as any\nproperty of the resulting model. Building on six months of ethnographic\nfieldwork with a corporate data science team---and channeling ideas from\nsociology and history of science, critical data studies, and early writing on\nknowledge discovery in databases---we describe the complex set of actors and\nactivities involved in problem formulation. Our research demonstrates that the\nspecification and operationalization of the problem are always negotiated and\nelastic, and rarely worked out with explicit normative considerations in mind.\nIn so doing, we show that careful accounts of everyday data science work can\nhelp us better understand how and why data science problems are posed in\ncertain ways---and why specific formulations prevail in practice, even in the\nface of what might seem like normatively preferable alternatives. We conclude\nby discussing the implications of our findings, arguing that effective\nnormative interventions will require attending to the practical work of problem\nformulation.\n","authors":"Samir Passi|Solon Barocas","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.02547v1","link_pdf":"http://arxiv.org/pdf/1901.02547v1","link_doi":"http://dx.doi.org/10.1145/3287560.3287567","comment":"Conference on Fairness, Accountability, and Transparency (FAT* '19),\n January 29-31, 2019, Atlanta, GA, USA","journal_ref":"","doi":"10.1145/3287560.3287567","primary_category":"cs.CY","categories":"cs.CY"} {"id":"1901.03025v2","submitted":"2019-01-10 06:01:18","updated":"2019-11-21 07:28:01","title":"System-of-Systems Modeling, Analysis and Optimization of Hybrid\n Vehicular Traffic","abstract":" While the development of fully autonomous vehicles is one of the major\nresearch fields in the Intelligent Transportation Systems (ITSs) domain, the\nupcoming longterm transition period - the hybrid vehicular traffic - is often\nneglected. However, within the next decades, automotive systems with\nheterogeneous autonomy levels will share the same road network, resulting in\nnew problems for traffic management systems and communication network\ninfrastructure providers. In this paper, we identify key challenges of the\nupcoming hybrid traffic scenario and present a system-of-systems model, which\nbrings together approaches and methods from traffic modeling, data science, and\ncommunication engineering in order to allow data-driven traffic flow\noptimization. The proposed model consists of data acquisition, data transfer,\ndata analysis, and data exploitation and exploits real world sensor data as\nwell as simulative optimization methods. Based on the results of multiple case\nstudies, which focus on individual challenges (e.g., resource-efficient data\ntransfer and dynamic routing of vehicles), we point out approaches for using\nthe existing infrastructure with a higher grade of efficiency.\n","authors":"Benjamin Sliwa|Thomas Liebig|Tim Vranken|Michael Schreckenberg|Christian Wietfeld","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.03025v2","link_pdf":"http://arxiv.org/pdf/1901.03025v2","link_doi":"http://dx.doi.org/10.1109/SYSCON.2019.8836786","comment":"","journal_ref":"2019 IEEE International Systems Conference (SysCon)","doi":"10.1109/SYSCON.2019.8836786","primary_category":"cs.NI","categories":"cs.NI"} {"id":"1901.03319v2","submitted":"2019-01-10 18:52:02","updated":"2019-06-13 16:16:11","title":"Skeletonisation Algorithms with Theoretical Guarantees for Unorganised\n Point Clouds with High Levels of Noise","abstract":" Data Science aims to extract meaningful knowledge from unorganised data. Real\ndatasets usually come in the form of a cloud of points with only pairwise\ndistances. Numerous applications require to visualise an overall shape of a\nnoisy cloud of points sampled from a non-linear object that is more complicated\nthan a union of disjoint clusters. The skeletonisation problem in its hardest\nform is to find a 1-dimensional skeleton that correctly represents a shape of\nthe cloud. This paper compares several algorithms that solve the above\nskeletonisation problem for any point cloud and guarantee a successful\nreconstruction. For example, given a highly noisy point sample of an unknown\nunderlying graph, a reconstructed skeleton should be geometrically close and\nhomotopy equivalent to (has the same number of independent cycles as) the\nunderlying graph. One of these algorithm produces a Homologically Persistent\nSkeleton (HoPeS) for any cloud without extra parameters. This universal\nskeleton contains sub-graphs that provably represent the 1-dimensional shape of\nthe cloud at any scale. Other subgraphs of HoPeS reconstruct an unknown graph\nfrom its noisy point sample with a correct homotopy type and within a small\noffset of the sample. The extensive experiments on synthetic and real data\nreveal for the first time the maximum level of noise that allows successful\ngraph reconstructions.\n","authors":"Vitaliy Kurlin|Philip Smith","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.03319v2","link_pdf":"http://arxiv.org/pdf/1901.03319v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CG","categories":"cs.CG"} {"id":"1901.03678v1","submitted":"2019-01-11 18:06:05","updated":"2019-01-11 18:06:05","title":"Machine Learning Automation Toolbox (MLaut)","abstract":" In this paper we present MLaut (Machine Learning AUtomation Toolbox) for the\npython data science ecosystem. MLaut automates large-scale evaluation and\nbenchmarking of machine learning algorithms on a large number of datasets.\nMLaut provides a high-level workflow interface to machine algorithm algorithms,\nimplements a local back-end to a database of dataset collections, trained\nalgorithms, and experimental results, and provides easy-to-use interfaces to\nthe scikit-learn and keras modelling libraries. Experiments are easy to set up\nwith default settings in a few lines of code, while remaining fully\ncustomizable to the level of hyper-parameter tuning, pipeline composition, or\ndeep learning architecture.\n As a principal test case for MLaut, we conducted a large-scale supervised\nclassification study in order to benchmark the performance of a number of\nmachine learning algorithms - to our knowledge also the first larger-scale\nstudy on standard supervised learning data sets to include deep learning\nalgorithms. While corroborating a number of previous findings in literature, we\nfound (within the limitations of our study) that deep neural networks do not\nperform well on basic supervised learning, i.e., outside the more specialized,\nimage-, audio-, or text-based tasks.\n","authors":"Viktor Kazakov|Franz J. Király","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.03678v1","link_pdf":"http://arxiv.org/pdf/1901.03678v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.AI|stat.ML"} {"id":"1901.04824v1","submitted":"2019-01-14 16:13:27","updated":"2019-01-14 16:13:27","title":"Approaching Ethical Guidelines for Data Scientists","abstract":" The goal of this article is to inspire data scientists to participate in the\ndebate on the impact that their professional work has on society, and to become\nactive in public debates on the digital world as data science professionals.\nHow do ethical principles (e.g., fairness, justice, beneficence, and\nnon-maleficence) relate to our professional lives? What lies in our\nresponsibility as professionals by our expertise in the field? More\nspecifically this article makes an appeal to statisticians to join that debate,\nand to be part of the community that establishes data science as a proper\nprofession in the sense of Airaksinen, a philosopher working on professional\nethics. As we will argue, data science has one of its roots in statistics and\nextends beyond it. To shape the future of statistics, and to take\nresponsibility for the statistical contributions to data science, statisticians\nshould actively engage in the discussions. First the term data science is\ndefined, and the technical changes that have led to a strong influence of data\nscience on society are outlined. Next the systematic approach from CNIL is\nintroduced. Prominent examples are given for ethical issues arising from the\nwork of data scientists. Further we provide reasons why data scientists should\nengage in shaping morality around and to formulate codes of conduct and codes\nof practice for data science. Next we present established ethical guidelines\nfor the related fields of statistics and computing machinery. Thereafter\nnecessary steps in the community to develop professional ethics for data\nscience are described. Finally we give our starting statement for the debate:\nData science is in the focal point of current societal development. Without\nbecoming a profession with professional ethics, data science will fail in\nbuilding trust in its interaction with and its much needed contributions to\nsociety!\n","authors":"Ursula Garzcarek|Detlef Steuer","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.04824v1","link_pdf":"http://arxiv.org/pdf/1901.04824v1","link_doi":"","comment":"18 pages, submitted Nov 12th 2018","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT|cs.AI|cs.CY|stat.ML|62A01"} {"id":"1901.04542v1","submitted":"2019-01-14 20:00:45","updated":"2019-01-14 20:00:45","title":"BoostNet: Bootstrapping detection of socialbots, and a case study from\n Guatemala","abstract":" We present a method to reconstruct networks of socialbots given minimal\ninput. Then we use Kernel Density Estimates of Botometer scores from 47,000\nsocial networking accounts to find clusters of automated accounts, discovering\nover 5,000 socialbots. This statistical and data driven approach allows for\ninference of thresholds for socialbot detection, as illustrated in a case study\nwe present from Guatemala.\n","authors":"E. I. Velazquez Richards|E. Gallagher|P. Suárez-Serrato","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.04542v1","link_pdf":"http://arxiv.org/pdf/1901.04542v1","link_doi":"http://dx.doi.org/10.1007/978-3-030-31551-1_11","comment":"7 pages, 4 figures","journal_ref":"Selected Contributions on Statistics and Data Science in Latin\n America. FNE 2018. Springer Proceedings in Mathematics & Statistics, vol 301","doi":"10.1007/978-3-030-31551-1_11","primary_category":"cs.CY","categories":"cs.CY|cs.SI"} {"id":"1901.04828v1","submitted":"2019-01-15 14:04:20","updated":"2019-01-15 14:04:20","title":"Metric Limits in Categories with a Flow","abstract":" In topological data science, categories with a flow have become ubiquitous,\nincluding as special cases examples like persistence modules and sheaves. With\nthe flow comes an interleaving distance, which has proven useful for\napplications. We give simple, categorical conditions which guarantee metric\ncompleteness of a category with a flow, meaning that every Cauchy sequence has\na limit. We also describe how to find a metric completion of a category with a\nflow by using its Yoneda embedding. The overarching goal of this work is to\nprepare the way for a theory of convergence of probability measures on these\ncategories with a flow.\n","authors":"Joshua Cruz","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.04828v1","link_pdf":"http://arxiv.org/pdf/1901.04828v1","link_doi":"","comment":"17 pages","journal_ref":"","doi":"","primary_category":"math.CT","categories":"math.CT|math.AT|18, 55"} {"id":"1901.04923v3","submitted":"2019-01-15 16:42:41","updated":"2019-06-05 12:30:40","title":"On (The Lack Of) Location Privacy in Crowdsourcing Applications","abstract":" Crowdsourcing enables application developers to benefit from large and\ndiverse datasets at a low cost. Specifically, mobile crowdsourcing (MCS)\nleverages users' devices as sensors to perform geo-located data collection. The\ncollection of geolocated data raises serious privacy concerns for users. Yet,\ndespite the large research body on location privacy-preserving mechanisms\n(LPPMs), MCS developers implement little to no protection for data collection\nor publication. To understand this mismatch, we study the performance of\nexisting LPPMs on publicly available data from two mobile crowdsourcing\nprojects. Our results show that well-established defenses are either not\napplicable or offer little protection in the MCS setting. Additionally, they\nhave a much stronger impact on applications' utility than foreseen in the\nliterature. This is because existing LPPMs, designed with location-based\nservices (LBSs) in mind, are optimized for utility functions based on users'\nlocations, while MCS utility functions depend on the values (e.g.,\nmeasurements) associated with those locations. We finally outline possible\nresearch avenues to facilitate the development of new location privacy\nsolutions that fit the needs of MCS so that the increasing number of such\napplications do not jeopardize their users' privacy.\n","authors":"Spyros Boukoros|Mathias Humbert|Stefan Katzenbeisser|Carmela Troncoso","affiliations":"TU-Darmstadt, Germany|Swiss Data Science Center, ETH Zurich and EPFL, Switzerland|TU-Darmstadt, Germany|EPFL, Switzerland","link_abstract":"http://arxiv.org/abs/1901.04923v3","link_pdf":"http://arxiv.org/pdf/1901.04923v3","link_doi":"","comment":"restructure and new title","journal_ref":"","doi":"","primary_category":"cs.CR","categories":"cs.CR"} {"id":"1901.05147v1","submitted":"2019-01-16 06:10:45","updated":"2019-01-16 06:10:45","title":"The Winning Solution to the IEEE CIG 2017 Game Data Mining Competition","abstract":" Machine learning competitions such as those organized by Kaggle or KDD\nrepresent a useful benchmark for data science research. In this work, we\npresent our winning solution to the Game Data Mining competition hosted at the\n2017 IEEE Conference on Computational Intelligence and Games (CIG 2017). The\ncontest consisted of two tracks, and participants (more than 250, belonging to\nboth industry and academia) were to predict which players would stop playing\nthe game, as well as their remaining lifetime. The data were provided by a\nmajor worldwide video game company, NCSoft, and came from their successful\nmassively multiplayer online game Blade and Soul. Here, we describe the long\nshort-term memory approach and conditional inference survival ensemble model\nthat made us win both tracks of the contest, as well as the validation\nprocedure that we followed in order to prevent overfitting. In particular,\nchoosing a survival method able to deal with censored data was crucial to\naccurately predict the moment in which each player would leave the game, as\ncensoring is inherent in churn. The selected models proved to be robust against\nevolving conditions---since there was a change in the business model of the\ngame (from subscription-based to free-to-play) between the two sample datasets\nprovided---and efficient in terms of time cost. Thanks to these features and\nalso to their a ability to scale to large datasets, our models could be readily\nimplemented in real business settings.\n","authors":"Anna Guitart|Pei Pei Chen|África Periáñez","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.05147v1","link_pdf":"http://arxiv.org/pdf/1901.05147v1","link_doi":"http://dx.doi.org/10.3390/make1010016","comment":"","journal_ref":"Machine Learning and Knowledge Extraction, 1(1), 252--264, 2019","doi":"10.3390/make1010016","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1901.05520v1","submitted":"2019-01-16 20:36:54","updated":"2019-01-16 20:36:54","title":"Trends in Demand, Growth, and Breadth in Scientific Computing Training\n Delivered by a High-Performance Computing Center","abstract":" We analyze the changes in the training and educational efforts of the SciNet\nHPC Consortium, a Canadian academic High Performance Computing center, in the\nareas of Scientific Computing and High-Performance Computing, over the last six\nyears. Initially, SciNet offered isolated training events on how to use HPC\nsystems and write parallel code, but the training program now consists of a\nbroad range of workshops and courses that users can take toward certificates in\nscientific computing, data science, or high-performance computing. Using data\non enrollment, attendence, and certificate numbers from SciNet's education\nwebsite, used by almost 1800 users so far, we extract trends on the growth,\ndemand, and breadth of SciNet's training program. Among the results are a\nsteady overall growth, a sharp and steady increase in the demand for data\nscience training, and a wider participation of 'non-traditional' computing\ndisciplines, which has motivated an increasingly broad spectrum of training\nofferings. Of interest is also that many of the training initiatives have\nevolved into courses that can be taken as part of the graduate curriculum at\nthe University of Toronto.\n","authors":"Ramses van Zon|Marcelo Ponce|Erik Spence|Daniel Gruner","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.05520v1","link_pdf":"http://arxiv.org/pdf/1901.05520v1","link_doi":"http://dx.doi.org/10.22369/issn.2153-4136/10/1/9","comment":"Presented at the Fifth Workshop on Best Practices for Enhancing HPC\n Training and Education (BPHTE18) @ SC18","journal_ref":"Journal of Computational Science Education Volume 10, Issue 1, pp.\n 53-60 (2019)","doi":"10.22369/issn.2153-4136/10/1/9","primary_category":"cs.CY","categories":"cs.CY|cs.DC"} {"id":"1901.06261v1","submitted":"2019-01-17 00:23:41","updated":"2019-01-17 00:23:41","title":"NeuNetS: An Automated Synthesis Engine for Neural Network Design","abstract":" Application of neural networks to a vast variety of practical applications is\ntransforming the way AI is applied in practice. Pre-trained neural network\nmodels available through APIs or capability to custom train pre-built neural\nnetwork architectures with customer data has made the consumption of AI by\ndevelopers much simpler and resulted in broad adoption of these complex AI\nmodels. While prebuilt network models exist for certain scenarios, to try and\nmeet the constraints that are unique to each application, AI teams need to\nthink about developing custom neural network architectures that can meet the\ntradeoff between accuracy and memory footprint to achieve the tight constraints\nof their unique use-cases. However, only a small proportion of data science\nteams have the skills and experience needed to create a neural network from\nscratch, and the demand far exceeds the supply. In this paper, we present\nNeuNetS : An automated Neural Network Synthesis engine for custom neural\nnetwork design that is available as part of IBM's AI OpenScale's product.\nNeuNetS is available for both Text and Image domains and can build neural\nnetworks for specific tasks in a fraction of the time it takes today with human\neffort, and with accuracy similar to that of human-designed AI models.\n","authors":"Atin Sood|Benjamin Elder|Benjamin Herta|Chao Xue|Costas Bekas|A. Cristiano I. Malossi|Debashish Saha|Florian Scheidegger|Ganesh Venkataraman|Gegi Thomas|Giovanni Mariani|Hendrik Strobelt|Horst Samulowitz|Martin Wistuba|Matteo Manica|Mihir Choudhury|Rong Yan|Roxana Istrate|Ruchir Puri|Tejaswini Pedapati","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.06261v1","link_pdf":"http://arxiv.org/pdf/1901.06261v1","link_doi":"","comment":"14 pages, 12 figures. arXiv admin note: text overlap with\n arXiv:1806.00250","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.SE|stat.ML"} {"id":"1901.06075v4","submitted":"2019-01-18 03:44:51","updated":"2019-07-08 18:34:54","title":"Splitting Methods for Convex Bi-Clustering and Co-Clustering","abstract":" Co-Clustering, the problem of simultaneously identifying clusters across\nmultiple aspects of a data set, is a natural generalization of clustering to\nhigher-order structured data. Recent convex formulations of bi-clustering and\ntensor co-clustering, which shrink estimated centroids together using a convex\nfusion penalty, allow for global optimality guarantees and precise theoretical\nanalysis, but their computational properties have been less well studied. In\nthis note, we present three efficient operator-splitting methods for the convex\nco-clustering problem: a standard two-block ADMM, a Generalized ADMM which\navoids an expensive tensor Sylvester equation in the primal update, and a\nthree-block ADMM based on the operator splitting scheme of Davis and Yin.\nTheoretical complexity analysis suggests, and experimental evidence confirms,\nthat the Generalized ADMM is far more efficient for large problems.\n","authors":"Michael Weylandt","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.06075v4","link_pdf":"http://arxiv.org/pdf/1901.06075v4","link_doi":"http://dx.doi.org/10.1109/DSW.2019.8755599","comment":"To appear in IEEE DSW 2019","journal_ref":"DSW 2019: Proceedings of the IEEE Data Science Workshop 2019, pp.\n 237-244","doi":"10.1109/DSW.2019.8755599","primary_category":"stat.ML","categories":"stat.ML|stat.CO"} {"id":"1901.06494v1","submitted":"2019-01-19 09:57:15","updated":"2019-01-19 09:57:15","title":"Writer Independent Offline Signature Recognition Using Ensemble Learning","abstract":" The area of Handwritten Signature Verification has been broadly researched in\nthe last decades, but remains an open research problem. In offline (static)\nsignature verification, the dynamic information of the signature writing\nprocess is lost, and it is difficult to design good feature extractors that can\ndistinguish genuine signatures and skilled forgeries. This verification task is\neven harder in writer independent scenarios which is undeniably fiscal for\nrealistic cases. In this paper, we have proposed an Ensemble model for offline\nwriter, independent signature verification task with Deep learning. We have\nused two CNNs for feature extraction, after that RGBT for classification &\nStacking to generate final prediction vector. We have done extensive\nexperiments on various datasets from various sources to maintain a variance in\nthe dataset. We have achieved the state of the art performance on various\ndatasets.\n","authors":"Sourya Dipta Das|Himanshu Ladia|Vaibhav Kumar|Shivansh Mishra","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.06494v1","link_pdf":"http://arxiv.org/pdf/1901.06494v1","link_doi":"","comment":"6 pages, 2 figures, International Conference on Data Science, Machine\n Learning & Applications (ICDSMLA)","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV|cs.LG|stat.ML"} {"id":"1901.06551v1","submitted":"2019-01-19 16:36:49","updated":"2019-01-19 16:36:49","title":"Synthesizing facial photometries and corresponding geometries using\n generative adversarial networks","abstract":" Artificial data synthesis is currently a well studied topic with useful\napplications in data science, computer vision, graphics and many other fields.\nGenerating realistic data is especially challenging since human perception is\nhighly sensitive to non realistic appearance. In recent times, new levels of\nrealism have been achieved by advances in GAN training procedures and\narchitectures. These successful models, however, are tuned mostly for use with\nregularly sampled data such as images, audio and video. Despite the successful\napplication of the architecture on these types of media, applying the same\ntools to geometric data poses a far greater challenge. The study of geometric\ndeep learning is still a debated issue within the academic community as the\nlack of intrinsic parametrization inherent to geometric objects prohibits the\ndirect use of convolutional filters, a main building block of today's machine\nlearning systems. In this paper we propose a new method for generating\nrealistic human facial geometries coupled with overlayed textures. We\ncircumvent the parametrization issue by imposing a global mapping from our data\nto the unit rectangle. We further discuss how to design such a mapping to\ncontrol the mapping distortion and conserve area within the mapped image. By\nrepresenting geometric textures and geometries as images, we are able to use\nadvanced GAN methodologies to generate new geometries. We address the often\nneglected topic of relation between texture and geometry and propose to use\nthis correlation to match between generated textures and their corresponding\ngeometries. We offer a new method for training GAN models on partially\ncorrupted data. Finally, we provide empirical evidence demonstrating our\ngenerative model's ability to produce examples of new identities independent\nfrom the training data while maintaining a high level of realism, two traits\nthat are often at odds.\n","authors":"Gil Shamai|Ron Slossberg|Ron Kimmel","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.06551v1","link_pdf":"http://arxiv.org/pdf/1901.06551v1","link_doi":"","comment":"23 pages, 16 figures, 3 tables","journal_ref":"","doi":"","primary_category":"cs.CG","categories":"cs.CG|cs.CV"} {"id":"1901.07329v4","submitted":"2019-01-22 14:33:23","updated":"2020-02-26 14:21:28","title":"The autofeat Python Library for Automated Feature Engineering and\n Selection","abstract":" This paper describes the autofeat Python library, which provides scikit-learn\nstyle linear regression and classification models with automated feature\nengineering and selection capabilities. Complex non-linear machine learning\nmodels, such as neural networks, are in practice often difficult to train and\neven harder to explain to non-statisticians, who require transparent analysis\nresults as a basis for important business decisions. While linear models are\nefficient and intuitive, they generally provide lower prediction accuracies.\nOur library provides a multi-step feature engineering and selection process,\nwhere first a large pool of non-linear features is generated, from which then a\nsmall and robust set of meaningful features is selected, which improve the\nprediction accuracy of a linear model while retaining its interpretability.\n","authors":"Franziska Horn|Robert Pack|Michael Rieger","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.07329v4","link_pdf":"http://arxiv.org/pdf/1901.07329v4","link_doi":"","comment":"ECMLPKDD 2019 Workshop on Automating Data Science (ADS)","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1901.07852v3","submitted":"2019-01-23 12:42:22","updated":"2019-04-28 06:26:11","title":"Homomorphic Sensing","abstract":" A recent line of research termed unlabeled sensing and shuffled linear\nregression has been exploring under great generality the recovery of signals\nfrom subsampled and permuted measurements; a challenging problem in diverse\nfields of data science and machine learning. In this paper we introduce an\nabstraction of this problem which we call homomorphic sensing. Given a linear\nsubspace and a finite set of linear transformations we develop an algebraic\ntheory which establishes conditions guaranteeing that points in the subspace\nare uniquely determined from their homomorphic image under some transformation\nin the set. As a special case, we recover known conditions for unlabeled\nsensing, as well as new results and extensions. On the algorithmic level we\nexhibit two dynamic programming based algorithms, which to the best of our\nknowledge are the first working solutions for the unlabeled sensing problem for\nsmall dimensions. One of them, additionally based on branch-and-bound, when\napplied to image registration under affine transformations, performs on par\nwith or outperforms state-of-the-art methods on benchmark datasets.\n","authors":"Manolis C. Tsakiris|Liangzu Peng","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.07852v3","link_pdf":"http://arxiv.org/pdf/1901.07852v3","link_doi":"","comment":"","journal_ref":"Proceedings of the 36th International Conference on Machine\n Learning, PMLR 97:6335-6344, 2019","doi":"","primary_category":"cs.IT","categories":"cs.IT|math.IT"} {"id":"1901.08152v5","submitted":"2019-01-23 22:17:36","updated":"2019-11-12 22:26:48","title":"Veridical Data Science","abstract":" Building and expanding on principles of statistics, machine learning, and\nscientific inquiry, we propose the predictability, computability, and stability\n(PCS) framework for veridical data science. Our framework, comprised of both a\nworkflow and documentation, aims to provide responsible, reliable,\nreproducible, and transparent results across the entire data science life\ncycle. The PCS workflow uses predictability as a reality check and considers\nthe importance of computation in data collection/storage and algorithm design.\nIt augments predictability and computability with an overarching stability\nprinciple for the data science life cycle. Stability expands on statistical\nuncertainty considerations to assess how human judgment calls impact data\nresults through data and model/algorithm perturbations. Moreover, we develop\ninference procedures that build on PCS, namely PCS perturbation intervals and\nPCS hypothesis testing, to investigate the stability of data results relative\nto problem formulation, data cleaning, modeling decisions, and interpretations.\nWe illustrate PCS inference through neuroscience and genomics projects of our\nown and others and compare it to existing methods in high dimensional, sparse\nlinear model simulations. Over a wide range of misspecified simulation models,\nPCS inference demonstrates favorable performance in terms of ROC curves.\nFinally, we propose PCS documentation based on R Markdown or Jupyter Notebook,\nwith publicly available, reproducible codes and narratives to back up human\nchoices made throughout an analysis. The PCS workflow and documentation are\ndemonstrated in a genomics case study available on Zenodo.\n","authors":"Bin Yu|Karl Kumbier","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.08152v5","link_pdf":"http://arxiv.org/pdf/1901.08152v5","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1901.08605v1","submitted":"2019-01-24 19:01:12","updated":"2019-01-24 19:01:12","title":"Investing for Discovery and Sustainability in Astronomy in the 2020s","abstract":" As the next decade approaches, it is once again time for the US astronomical\ncommunity to assess its investment priorities for the coming decade on the\nground and in space. This report, created to aid NOAO in its planning for the\n2020 Decadal Survey on Astronomy and Astrophysics, reviews the outcome of the\nprevious Decadal Survey (Astro2010); describes the themes that emerged from the\n2018 NOAO community planning workshop \"NOAO Community Needs for Science in the\n2020s\"; and based on the above, offers thoughts for the coming review. We find\nthat a balanced set of investments in small- to large-scale initiatives is\nessential to a sustainable future, based on the experience of previous decades.\nWhile large facilities are the \"value\" investments that are guaranteed to\nproduce compelling science and discoveries, smaller facilities are the \"growth\nstocks\" that are likely to deliver the biggest science bang per buck, sometimes\nwith outsize returns. Investments in data-intensive missions also have benefits\nto society beyond the science they deliver. By training scientists who are well\nequipped to use their data science skills to solve problems in the public or\nprivate sector, astronomy can provide a valuable service to society by\ncontributing to a data-capable workforce.\n","authors":"Joan R. Najita","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.08605v1","link_pdf":"http://arxiv.org/pdf/1901.08605v1","link_doi":"","comment":"12 pages; see also http://ast.noao.edu/activities/decadal-survey","journal_ref":"","doi":"","primary_category":"astro-ph.IM","categories":"astro-ph.IM"} {"id":"1901.08705v1","submitted":"2019-01-25 01:10:55","updated":"2019-01-25 01:10:55","title":"Ambitious Data Science Can Be Painless","abstract":" Modern data science research can involve massive computational\nexperimentation; an ambitious PhD in computational fields may do experiments\nconsuming several million CPU hours. Traditional computing practices, in which\nresearchers use laptops or shared campus-resident resources, are inadequate for\nexperiments at the massive scale and varied scope that we now see in data\nscience. On the other hand, modern cloud computing promises seemingly unlimited\ncomputational resources that can be custom configured, and seems to offer a\npowerful new venue for ambitious data-driven science. Exploiting the cloud\nfully, the amount of work that could be completed in a fixed amount of time can\nexpand by several orders of magnitude.\n As potentially powerful as cloud-based experimentation may be in the\nabstract, it has not yet become a standard option for researchers in many\nacademic disciplines. The prospect of actually conducting massive computational\nexperiments in today's cloud systems confronts the potential user with daunting\nchallenges. Leading considerations include: (i) the seeming complexity of\ntoday's cloud computing interface, (ii) the difficulty of executing an\noverwhelmingly large number of jobs, and (iii) the difficulty of monitoring and\ncombining a massive collection of separate results. Starting a massive\nexperiment `bare-handed' seems therefore highly problematic and prone to rapid\n`researcher burn out'.\n New software stacks are emerging that render massive cloud experiments\nrelatively painless. Such stacks simplify experimentation by systematizing\nexperiment definition, automating distribution and management of tasks, and\nallowing easy harvesting of results and documentation. In this article, we\ndiscuss several painless computing stacks that abstract away the difficulties\nof massive experimentation, thereby allowing a proliferation of ambitious\nexperiments for scientific discovery.\n","authors":"Hatef Monajemi|Riccardo Murri|Eric Jonas|Percy Liang|Victoria Stodden|David L. Donoho","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.08705v1","link_pdf":"http://arxiv.org/pdf/1901.08705v1","link_doi":"","comment":"Submitted to Harvard Data Science Review","journal_ref":"","doi":"","primary_category":"cs.DC","categories":"cs.DC"} {"id":"1901.09548v3","submitted":"2019-01-28 08:42:39","updated":"2019-11-26 00:28:54","title":"CURE: Curvature Regularization For Missing Data Recovery","abstract":" Missing data recovery is an important and yet challenging problem in imaging\nand data science. Successful models often adopt certain carefully chosen\nregularization. Recently, the low dimension manifold model (LDMM) was\nintroduced by S.Osher et al. and shown effective in image inpainting. They\nobserved that enforcing low dimensionality on image patch manifold serves as a\ngood image regularizer. In this paper, we observe that having only the low\ndimension manifold regularization is not enough sometimes, and we need\nsmoothness as well. For that, we introduce a new regularization by combining\nthe low dimension manifold regularization with a higher order Curvature\nRegularization, and we call this new regularization CURE for short. The key\nstep of solving CURE is to solve a biharmonic equation on a manifold. We\nfurther introduce a weighted version of CURE, called WeCURE, in a similar\nmanner as the weighted nonlocal Laplacian (WNLL) method. Numerical experiments\nfor image inpainting and semi-supervised learning show that the proposed CURE\nand WeCURE significantly outperform LDMM and WNLL respectively.\n","authors":"Bin Dong|Haocheng Ju|Yiping Lu|Zuoqiang Shi","affiliations":"","link_abstract":"http://arxiv.org/abs/1901.09548v3","link_pdf":"http://arxiv.org/pdf/1901.09548v3","link_doi":"","comment":"17 pages, 7 figures, 4 tables","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV|cs.NA|math.NA"} {"id":"1902.00197v3","submitted":"2019-02-01 06:28:38","updated":"2019-05-18 07:16:26","title":"Adaptive Monte Carlo Multiple Testing via Multi-Armed Bandits","abstract":" Monte Carlo (MC) permutation test is considered the gold standard for\nstatistical hypothesis testing, especially when standard parametric assumptions\nare not clear or likely to fail. However, in modern data science settings where\na large number of hypothesis tests need to be performed simultaneously, it is\nrarely used due to its prohibitive computational cost. In genome-wide\nassociation studies, for example, the number of hypothesis tests $m$ is around\n$10^6$ while the number of MC samples $n$ for each test could be greater than\n$10^8$, totaling more than $nm$=$10^{14}$ samples. In this paper, we propose\nAdaptive MC multiple Testing (AMT) to estimate MC p-values and control false\ndiscovery rate in multiple testing. The algorithm outputs the same result as\nthe standard full MC approach with high probability while requiring only\n$\\tilde{O}(\\sqrt{n}m)$ samples. This sample complexity is shown to be optimal.\nOn a Parkinson GWAS dataset, the algorithm reduces the running time from 2\nmonths for full MC to an hour. The AMT algorithm is derived based on the theory\nof multi-armed bandits.\n","authors":"Martin J. Zhang|James Zou|David Tse","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.00197v3","link_pdf":"http://arxiv.org/pdf/1902.00197v3","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ME","categories":"stat.ME|cs.IT|math.IT|q-bio.GN"} {"id":"1902.00300v3","submitted":"2019-02-01 12:19:37","updated":"2020-02-25 10:29:05","title":"MASER: A Science Ready Toolbox for Low Frequency Radio Astronomy","abstract":" MASER (Measurements, Analysis, and Simulation of Emission in the Radio range)\nis a comprehensive infrastructure dedicated to time-dependent low frequency\nradio astronomy (up to about 50 MHz). The main radio sources observed in this\nspectral range are the Sun, the magnetized planets (Earth, Jupiter, Saturn),\nand our Galaxy, which are observed either from ground or space. Ground\nobservatories can capture high resolution data streams with a high sensitivity.\nConversely, space-borne instruments can observe below the ionospheric cut-off\n(at about 10 MHz) and can be placed closer to the studied object. Several tools\nhave been developed in the last decade for sharing space physics data. Data\nvisualization tools developed by various institutes are available to share,\ndisplay and analyse space physics time series and spectrograms. The MASER team\nhas selected a sub-set of those tools and applied them to low frequency radio\nastronomy. MASER also includes a Python software library for reading raw data\nfrom agency archives.\n","authors":"Baptiste Cecconi|Alan Loh|Pierre Le Sidaner|Renaud Savalle|Xavier Bonnin|Quynh Nhu Nguyen|Sonny Lion|Albert Shih|Stéphane Aicardi|Philippe Zarka|Corentin Louis|Andrée Coffre|Laurent Lamy|Laurent Denis|Jean-Mathias Grießmeier|Jeremy Faden|Chris Piker|Nicolas André|Vincent Génot|Stéphane Erard|Joseph N Mafi|Todd A King|Jim Sky|Markus Demleitner","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.00300v3","link_pdf":"http://arxiv.org/pdf/1902.00300v3","link_doi":"http://dx.doi.org/10.5334/dsj-2020-012","comment":"Submitted to Data Science Journal special issue after PV2018\n conference","journal_ref":"","doi":"10.5334/dsj-2020-012","primary_category":"astro-ph.IM","categories":"astro-ph.IM|astro-ph.EP"} {"id":"1902.00522v1","submitted":"2019-02-01 19:02:18","updated":"2019-02-01 19:02:18","title":"Deep Learning for Multi-Messenger Astrophysics: A Gateway for Discovery\n in the Big Data Era","abstract":" This report provides an overview of recent work that harnesses the Big Data\nRevolution and Large Scale Computing to address grand computational challenges\nin Multi-Messenger Astrophysics, with a particular emphasis on real-time\ndiscovery campaigns. Acknowledging the transdisciplinary nature of\nMulti-Messenger Astrophysics, this document has been prepared by members of the\nphysics, astronomy, computer science, data science, software and\ncyberinfrastructure communities who attended the NSF-, DOE- and NVIDIA-funded\n\"Deep Learning for Multi-Messenger Astrophysics: Real-time Discovery at Scale\"\nworkshop, hosted at the National Center for Supercomputing Applications,\nOctober 17-19, 2018. Highlights of this report include unanimous agreement that\nit is critical to accelerate the development and deployment of novel,\nsignal-processing algorithms that use the synergy between artificial\nintelligence (AI) and high performance computing to maximize the potential for\nscientific discovery with Multi-Messenger Astrophysics. We discuss key aspects\nto realize this endeavor, namely (i) the design and exploitation of scalable\nand computationally efficient AI algorithms for Multi-Messenger Astrophysics;\n(ii) cyberinfrastructure requirements to numerically simulate astrophysical\nsources, and to process and interpret Multi-Messenger Astrophysics data; (iii)\nmanagement of gravitational wave detections and triggers to enable\nelectromagnetic and astro-particle follow-ups; (iv) a vision to harness future\ndevelopments of machine and deep learning and cyberinfrastructure resources to\ncope with the scale of discovery in the Big Data Era; (v) and the need to build\na community that brings domain experts together with data scientists on equal\nfooting to maximize and accelerate discovery in the nascent field of\nMulti-Messenger Astrophysics.\n","authors":"Gabrielle Allen|Igor Andreoni|Etienne Bachelet|G. Bruce Berriman|Federica B. Bianco|Rahul Biswas|Matias Carrasco Kind|Kyle Chard|Minsik Cho|Philip S. Cowperthwaite|Zachariah B. Etienne|Daniel George|Tom Gibbs|Matthew Graham|William Gropp|Anushri Gupta|Roland Haas|E. A. Huerta|Elise Jennings|Daniel S. Katz|Asad Khan|Volodymyr Kindratenko|William T. C. Kramer|Xin Liu|Ashish Mahabal|Kenton McHenry|J. M. Miller|M. S. Neubauer|Steve Oberlin|Alexander R. Olivas Jr|Shawn Rosofsky|Milton Ruiz|Aaron Saxton|Bernard Schutz|Alex Schwing|Ed Seidel|Stuart L. Shapiro|Hongyu Shen|Yue Shen|Brigitta M. Sipőcz|Lunan Sun|John Towns|Antonios Tsokaros|Wei Wei|Jack Wells|Timothy J. Williams|Jinjun Xiong|Zhizhen Zhao","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.00522v1","link_pdf":"http://arxiv.org/pdf/1902.00522v1","link_doi":"","comment":"15 pages, no figures. White paper based on the \"Deep Learning for\n Multi-Messenger Astrophysics: Real-time Discovery at Scale\" workshop, hosted\n at NCSA, October 17-19, 2018\n http://www.ncsa.illinois.edu/Conferences/DeepLearningLSST/","journal_ref":"","doi":"","primary_category":"astro-ph.IM","categories":"astro-ph.IM|astro-ph.HE|cs.LG|gr-qc"} {"id":"1902.00562v1","submitted":"2019-02-01 20:58:34","updated":"2019-02-01 20:58:34","title":"The Spatially-Conscious Machine Learning Model","abstract":" Successfully predicting gentrification could have many social and commercial\napplications; however, real estate sales are difficult to predict because they\nbelong to a chaotic system comprised of intrinsic and extrinsic\ncharacteristics, perceived value, and market speculation. Using New York City\nreal estate as our subject, we combine modern techniques of data science and\nmachine learning with traditional spatial analysis to create robust real estate\nprediction models for both classification and regression tasks. We compare\nseveral cutting edge machine learning algorithms across spatial, semi-spatial\nand non-spatial feature engineering techniques, and we empirically show that\nspatially-conscious machine learning models outperform non-spatial models when\nmarried with advanced prediction techniques such as feed-forward artificial\nneural networks and gradient boosting machine models.\n","authors":"Timothy J. Kiely|Nathaniel D. Bastian","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.00562v1","link_pdf":"http://arxiv.org/pdf/1902.00562v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG"} {"id":"1902.00635v2","submitted":"2019-02-02 03:28:32","updated":"2019-09-04 04:56:15","title":"Uniform-in-Time Weak Error Analysis for Stochastic Gradient Descent\n Algorithms via Diffusion Approximation","abstract":" Diffusion approximation provides weak approximation for stochastic gradient\ndescent algorithms in a finite time horizon. In this paper, we introduce new\ntools motivated by the backward error analysis of numerical stochastic\ndifferential equations into the theoretical framework of diffusion\napproximation, extending the validity of the weak approximation from finite to\ninfinite time horizon. The new techniques developed in this paper enable us to\ncharacterize the asymptotic behavior of constant-step-size SGD algorithms for\nstrongly convex objective functions, a goal previously unreachable within the\ndiffusion approximation framework. Our analysis builds upon a truncated formal\npower expansion of the solution of a stochastic modified equation arising from\ndiffusion approximation, where the main technical ingredient is a\nuniform-in-time weak error bound controlling the long-term behavior of the\nexpansion coefficient functions near the global minimum. We expect these new\ntechniques to greatly expand the range of applicability of diffusion\napproximation to cover wider and deeper aspects of stochastic optimization\nalgorithms in data science.\n","authors":"Yuanyuan Feng|Tingran Gao|Lei Li|Jian-Guo Liu|Yulong Lu","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.00635v2","link_pdf":"http://arxiv.org/pdf/1902.00635v2","link_doi":"","comment":"22 pages, 3 figures. To appear in Comm. Math. Sci. (2019+)","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|math.OC|stat.ML|60J20, 90C15|G.1.6"} {"id":"1902.01194v4","submitted":"2019-02-04 14:12:30","updated":"2019-09-16 11:32:25","title":"Deep One-Class Classification Using Intra-Class Splitting","abstract":" This paper introduces a generic method which enables to use conventional deep\nneural networks as end-to-end one-class classifiers. The method is based on\nsplitting given data from one class into two subsets. In one-class\nclassification, only samples of one normal class are available for training.\nDuring inference, a closed and tight decision boundary around the training\nsamples is sought which conventional binary or multi-class neural networks are\nnot able to provide. By splitting data into typical and atypical normal\nsubsets, the proposed method can use a binary loss and defines an auxiliary\nsubnetwork for distance constraints in the latent space. Various experiments on\nthree well-known image datasets showed the effectiveness of the proposed method\nwhich outperformed seven baselines and had a better or comparable performance\nto the state-of-the-art.\n","authors":"Patrick Schlachter|Yiwen Liao|Bin Yang","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.01194v4","link_pdf":"http://arxiv.org/pdf/1902.01194v4","link_doi":"http://dx.doi.org/10.1109/DSW.2019.8755576","comment":"IEEE Data Science Workshop 2019 (DSW 2019)","journal_ref":"","doi":"10.1109/DSW.2019.8755576","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1902.01304v1","submitted":"2019-02-04 16:52:40","updated":"2019-02-04 16:52:40","title":"Declarative Data Analytics: a Survey","abstract":" The area of declarative data analytics explores the application of the\ndeclarative paradigm on data science and machine learning. It proposes\ndeclarative languages for expressing data analysis tasks and develops systems\nwhich optimize programs written in those languages. The execution engine can be\neither centralized or distributed, as the declarative paradigm advocates\nindependence from particular physical implementations. The survey explores a\nwide range of declarative data analysis frameworks by examining both the\nprogramming model and the optimization techniques used, in order to provide\nconclusions on the current state of the art in the area and identify open\nchallenges.\n","authors":"Nantia Makrynioti|Vasilis Vassalos","affiliations":"Athens University of Economics and Business|Athens University of Economics and Business","link_abstract":"http://arxiv.org/abs/1902.01304v1","link_pdf":"http://arxiv.org/pdf/1902.01304v1","link_doi":"","comment":"36 pages, 2 figures","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB|cs.LG"} {"id":"1902.01580v1","submitted":"2019-02-05 08:09:33","updated":"2019-02-05 08:09:33","title":"PUTWorkbench: Analysing Privacy in AI-intensive Systems","abstract":" AI intensive systems that operate upon user data face the challenge of\nbalancing data utility with privacy concerns. We propose the idea and present\nthe prototype of an open-source tool called Privacy Utility Trade-off (PUT)\nWorkbench which seeks to aid software practitioners to take such crucial\ndecisions. We pick a simple privacy model that doesn't require any background\nknowledge in Data Science and show how even that can achieve significant\nresults over standard and real-life datasets. The tool and the source code is\nmade freely available for extensions and usage.\n","authors":"Saurabh Srivastava|Vinay P. Namboodiri|T. V. Prabhakar","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.01580v1","link_pdf":"http://arxiv.org/pdf/1902.01580v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CR","categories":"cs.CR|cs.AI|cs.SE"} {"id":"1902.02376v1","submitted":"2019-02-06 19:42:14","updated":"2019-02-06 19:42:14","title":"DiffEqFlux.jl - A Julia Library for Neural Differential Equations","abstract":" DiffEqFlux.jl is a library for fusing neural networks and differential\nequations. In this work we describe differential equations from the viewpoint\nof data science and discuss the complementary nature between machine learning\nmodels and differential equations. We demonstrate the ability to incorporate\nDifferentialEquations.jl-defined differential equation problems into a\nFlux-defined neural network, and vice versa. The advantages of being able to\nuse the entire DifferentialEquations.jl suite for this purpose is demonstrated\nby counter examples where simple integration strategies fail, but the\nsophisticated integration strategies provided by the DifferentialEquations.jl\nlibrary succeed. This is followed by a demonstration of delay differential\nequations and stochastic differential equations inside of neural networks. We\nshow high-level functionality for defining neural ordinary differential\nequations (neural networks embedded into the differential equation) and\ndescribe the extra models in the Flux model zoo which includes neural\nstochastic differential equations. We conclude by discussing the various\nadjoint methods used for backpropogation of the differential equation solvers.\nDiffEqFlux.jl is an important contribution to the area, as it allows the full\nweight of the differential equation solvers developed from decades of research\nin the scientific computing field to be readily applied to the challenges posed\nby machine learning and data science.\n","authors":"Chris Rackauckas|Mike Innes|Yingbo Ma|Jesse Bettencourt|Lyndon White|Vaibhav Dixit","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.02376v1","link_pdf":"http://arxiv.org/pdf/1902.02376v1","link_doi":"","comment":"Julialang Blog post, DiffEqFlux.jl","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1902.02808v1","submitted":"2019-02-07 19:15:51","updated":"2019-02-07 19:15:51","title":"ML Health: Fitness Tracking for Production Models","abstract":" Deployment of machine learning (ML) algorithms in production for extended\nperiods of time has uncovered new challenges such as monitoring and management\nof real-time prediction quality of a model in the absence of labels. However,\nsuch tracking is imperative to prevent catastrophic business outcomes resulting\nfrom incorrect predictions. The scale of these deployments makes manual\nmonitoring prohibitive, making automated techniques to track and raise alerts\nimperative. We present a framework, ML Health, for tracking potential drops in\nthe predictive performance of ML models in the absence of labels. The framework\nemploys diagnostic methods to generate alerts for further investigation. We\ndevelop one such method to monitor potential problems when production data\npatterns do not match training data distributions. We demonstrate that our\nmethod performs better than standard \"distance metrics\", such as RMSE,\nKL-Divergence, and Wasserstein at detecting issues with mismatched data sets.\nFinally, we present a working system that incorporates the ML Health approach\nto monitor and manage ML deployments within a realistic full production ML\nlifecycle.\n","authors":"Sindhu Ghanta|Sriram Subramanian|Lior Khermosh|Swaminathan Sundararaman|Harshil Shah|Yakov Goldberg|Drew Roselli|Nisha Talagala","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.02808v1","link_pdf":"http://arxiv.org/pdf/1902.02808v1","link_doi":"","comment":"This paper has been submitted to the Data Science track of KDD 2019","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1902.03233v3","submitted":"2019-02-08 18:53:27","updated":"2020-01-21 02:27:50","title":"A 3D Probabilistic Deep Learning System for Detection and Diagnosis of\n Lung Cancer Using Low-Dose CT Scans","abstract":" We introduce a new computer aided detection and diagnosis system for lung\ncancer screening with low-dose CT scans that produces meaningful probability\nassessments. Our system is based entirely on 3D convolutional neural networks\nand achieves state-of-the-art performance for both lung nodule detection and\nmalignancy classification tasks on the publicly available LUNA16 and Kaggle\nData Science Bowl challenges. While nodule detection systems are typically\ndesigned and optimized on their own, we find that it is important to consider\nthe coupling between detection and diagnosis components. Exploiting this\ncoupling allows us to develop an end-to-end system that has higher and more\nrobust performance and eliminates the need for a nodule detection false\npositive reduction stage. Furthermore, we characterize model uncertainty in our\ndeep learning systems, a first for lung CT analysis, and show that we can use\nthis to provide well-calibrated classification probabilities for both nodule\ndetection and patient malignancy diagnosis. These calibrated probabilities\ninformed by model uncertainty can be used for subsequent risk-based decision\nmaking towards diagnostic interventions or disease treatments, as we\ndemonstrate using a probability-based patient referral strategy to further\nimprove our results.\n","authors":"Onur Ozdemir|Rebecca L. Russell|Andrew A. Berlin","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.03233v3","link_pdf":"http://arxiv.org/pdf/1902.03233v3","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV|cs.LG"} {"id":"1902.03721v1","submitted":"2019-02-11 04:14:53","updated":"2019-02-11 04:14:53","title":"High-Throughput Computational Studies in Catalysis and Materials\n Research, and their Impact on Rational Design","abstract":" In the 21st century, many technology fields have become reliant on\nadvancements in process automation. We have seen dramatic growth in areas and\nindustries that have successfully implemented a high level of automation. In\ndrug discovery, for example, it has alleviated an otherwise extremely complex\nand tedious process and has resulted in the development of several new drugs.\nOver the last decade, these automation techniques have begun being adapted in\nthe chemical and materials community as well with the goal of exploring\nchemical space and pursuing the discovery and design of novel compounds for\nvarious applications. The impact of new materials on industrial and economic\ndevelopment has been stimulating tremendous research efforts by the materials\ncommunity, and embracing automation as well as tools from computational and\ndata science have led to acceleration and streamlining of the discovery\nprocess. In particular, virtual high-throughput screening (HTPS) is now\nbecoming a mainstream technique to search for materials with properties that\nare tailored for specific applications. Its efficiency combined with the\nincreasing availability of open-source codes and large computational resources\nmakes it a powerful and attractive tool in materials research. Herein, we will\nreview a selection of recent, high-profile HTPS projects for new materials and\ncatalysts. In the case of catalysts, we focus on the HTPS studies for oxygen\nreduction reaction, oxygen evolution reaction, hydrogen evolution reaction, and\ncarbon dioxide reduction reaction. Whereas, for other materials applications,\nwe emphasize on the HTPS studies for photovoltaics, gas separation,\nhigh-refractive-index materials, and OLEDs.\n","authors":"Mohammad Atif Faiz Afzal|Johannes Hachmann","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.03721v1","link_pdf":"http://arxiv.org/pdf/1902.03721v1","link_doi":"","comment":"This is a review article. It contains 34 pages, 13 figures, and 211\n references","journal_ref":"","doi":"","primary_category":"physics.comp-ph","categories":"physics.comp-ph|cond-mat.mtrl-sci"} {"id":"1902.03825v4","submitted":"2019-02-11 11:31:38","updated":"2019-02-23 17:21:11","title":"Prediction of Malignant & Benign Breast Cancer: A Data Mining Approach\n in Healthcare Applications","abstract":" As much as data science is playing a pivotal role everywhere, healthcare also\nfinds it prominent application. Breast Cancer is the top rated type of cancer\namongst women; which took away 627,000 lives alone. This high mortality rate\ndue to breast cancer does need attention, for early detection so that\nprevention can be done in time. As a potential contributor to state-of-art\ntechnology development, data mining finds a multi-fold application in\npredicting Brest cancer. This work focuses on different classification\ntechniques implementation for data mining in predicting malignant and benign\nbreast cancer. Breast Cancer Wisconsin data set from the UCI repository has\nbeen used as experimental dataset while attribute clump thickness being used as\nan evaluation class. The performances of these twelve algorithms: Ada Boost M\n1, Decision Table, J Rip, Lazy IBK, Logistics Regression, Multiclass\nClassifier, Multilayer Perceptron, Naive Bayes, Random forest and Random Tree\nare analyzed on this data set. Keywords- Data Mining, Classification\nTechniques, UCI repository, Breast Cancer, Classification Algorithms\n","authors":"Vivek Kumar|Brojo Kishore Mishra|Manuel Mazzara|Dang N. H. Thanh|Abhishek Verma","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.03825v4","link_pdf":"http://arxiv.org/pdf/1902.03825v4","link_doi":"","comment":"8 Pages, 2 Figures, 4 Tables. Conference- Advances in Data Science\n and Management - Proceedings of ICDSM 2019 To be published with- Springer,\n Lecture Notes on Data Engineering and Communications Technologies series","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.CY|stat.ML"} {"id":"1902.06002v2","submitted":"2019-02-15 23:04:24","updated":"2019-09-30 22:43:13","title":"Group testing: an information theory perspective","abstract":" The group testing problem concerns discovering a small number of defective\nitems within a large population by performing tests on pools of items. A test\nis positive if the pool contains at least one defective, and negative if it\ncontains no defectives. This is a sparse inference problem with a combinatorial\nflavour, with applications in medical testing, biology, telecommunications,\ninformation technology, data science, and more. In this monograph, we survey\nrecent developments in the group testing problem from an information-theoretic\nperspective. We cover several related developments: efficient algorithms with\npractical storage and computation requirements, achievability bounds for\noptimal decoding methods, and algorithm-independent converse bounds. We assess\nthe theoretical guarantees not only in terms of scaling laws, but also in terms\nof the constant factors, leading to the notion of the \"rate\" of group testing,\nindicating the amount of information learned per test. Considering both\nnoiseless and noisy settings, we identify several regimes where existing\nalgorithms are provably optimal or near-optimal, as well as regimes where there\nremains greater potential for improvement. In addition, we survey results\nconcerning a number of variations on the standard group testing problem,\nincluding partial recovery criteria, adaptive algorithms with a limited number\nof stages, constrained test designs, and sublinear-time algorithms.\n","authors":"Matthew Aldridge|Oliver Johnson|Jonathan Scarlett","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.06002v2","link_pdf":"http://arxiv.org/pdf/1902.06002v2","link_doi":"http://dx.doi.org/10.1561/0100000099","comment":"Survey paper, 140 pages, 19 figures. To be published in Foundations\n and Trends in Communications and Information Theory","journal_ref":"Foundations and Trends in Communications and Information Theory:\n Vol. 15: No. 3-4, pp 196-392, 2019","doi":"10.1561/0100000099","primary_category":"cs.IT","categories":"cs.IT|cs.DM|math.IT|math.PR|math.ST|stat.TH"} {"id":"1902.06804v1","submitted":"2019-02-18 21:22:45","updated":"2019-02-18 21:22:45","title":"Democratisation of Usable Machine Learning in Computer Vision","abstract":" Many industries are now investing heavily in data science and automation to\nreplace manual tasks and/or to help with decision making, especially in the\nrealm of leveraging computer vision to automate many monitoring, inspection,\nand surveillance tasks. This has resulted in the emergence of the 'data\nscientist' who is conversant in statistical thinking, machine learning (ML),\ncomputer vision, and computer programming. However, as ML becomes more\naccessible to the general public and more aspects of ML become automated,\napplications leveraging computer vision are increasingly being created by\nnon-experts with less opportunity for regulatory oversight. This points to the\noverall need for more educated responsibility for these lay-users of usable ML\ntools in order to mitigate potentially unethical ramifications. In this paper,\nwe undertake a SWOT analysis to study the strengths, weaknesses, opportunities,\nand threats of building usable ML tools for mass adoption for important areas\nleveraging ML such as computer vision. The paper proposes a set of data science\nliteracy criteria for educating and supporting lay-users in the responsible\ndevelopment and deployment of ML applications.\n","authors":"Raymond Bond|Ansgar Koene|Alan Dix|Jennifer Boger|Maurice D. Mulvenna|Mykola Galushka|Bethany Waterhouse Bradley|Fiona Browne|Hui Wang|Alexander Wong","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.06804v1","link_pdf":"http://arxiv.org/pdf/1902.06804v1","link_doi":"","comment":"4 pages","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV|cs.AI|cs.LG"} {"id":"1902.07958v1","submitted":"2019-02-21 10:50:44","updated":"2019-02-21 10:50:44","title":"Deep Learning Multidimensional Projections","abstract":" Dimensionality reduction methods, also known as projections, are frequently\nused for exploring multidimensional data in machine learning, data science, and\ninformation visualization. Among these, t-SNE and its variants have become very\npopular for their ability to visually separate distinct data clusters. However,\nsuch methods are computationally expensive for large datasets, suffer from\nstability problems, and cannot directly handle out-of-sample data. We propose a\nlearning approach to construct such projections. We train a deep neural network\nbased on a collection of samples from a given data universe, and their\ncorresponding projections, and next use the network to infer projections of\ndata from the same, or similar, universes. Our approach generates projections\nwith similar characteristics as the learned ones, is computationally two to\nthree orders of magnitude faster than SNE-class methods, has no complex-to-set\nuser parameters, handles out-of-sample data in a stable manner, and can be used\nto learn any projection technique. We demonstrate our proposal on several\nreal-world high dimensional datasets from machine learning.\n","authors":"Mateus Espadoto|Nina S. T. Hirata|Alexandru C. Telea","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.07958v1","link_pdf":"http://arxiv.org/pdf/1902.07958v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1903.00405v1","submitted":"2019-02-21 14:42:52","updated":"2019-02-21 14:42:52","title":"Quantifying contribution and propagation of error from computational\n steps, algorithms and hyperparameter choices in image classification\n pipelines","abstract":" Data science relies on pipelines that are organized in the form of\ninterdependent computational steps. Each step consists of various candidate\nalgorithms that maybe used for performing a particular function. Each algorithm\nconsists of several hyperparameters. Algorithms and hyperparameters must be\noptimized as a whole to produce the best performance. Typical machine learning\npipelines consist of complex algorithms in each of the steps. Not only is the\nselection process combinatorial, but it is also important to interpret and\nunderstand the pipelines. We propose a method to quantify the importance of\ndifferent components in the pipeline, by computing an error contribution\nrelative to an agnostic choice of computational steps, algorithms and\nhyperparameters. We also propose a methodology to quantify the propagation of\nerror from individual components of the pipeline with the help of a naive set\nof benchmark algorithms not involved in the pipeline. We demonstrate our\nmethodology on image classification pipelines. The agnostic and naive\nmethodologies quantify the error contribution and propagation respectively from\nthe computational steps, algorithms and hyperparameters in the image\nclassification pipeline. We show that algorithm selection and hyperparameter\noptimization methods like grid search, random search and Bayesian optimization\ncan be used to quantify the error contribution and propagation, and that random\nsearch is able to quantify them more accurately than Bayesian optimization.\nThis methodology can be used by domain experts to understand machine learning\nand data analysis pipelines in terms of their individual components, which can\nhelp in prioritizing different components of the pipeline.\n","authors":"Aritra Chowdhury|Malik Magdon-Ismail|Bulent Yener","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.00405v1","link_pdf":"http://arxiv.org/pdf/1903.00405v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV|cs.LG|stat.ML"} {"id":"1902.08319v1","submitted":"2019-02-22 00:35:55","updated":"2019-02-22 00:35:55","title":"Multi-marginal Schrodinger bridges","abstract":" We consider the problem to identify the most likely flow in phase space, of\n(inertial) particles under stochastic forcing, that is in agreement with\nspatial (marginal) distributions that are specified at a set of points in time.\nThe question raised generalizes the classical Schrodinger Bridge Problem (SBP)\nwhich seeks to interpolate two specified end-point marginal distributions of\noverdamped particles driven by stochastic excitation. While we restrict our\nanalysis to second-order dynamics for the particles, the data represents\npartial (i.e., only positional) information on the flow at {\\em multiple}\ntime-points. The solution sought, as in SBP, represents a probability law on\nthe space of paths this closest to a uniform prior while consistent with the\ngiven marginals. We approach this problem as an optimal control problem to\nminimize an action integral a la Benamou-Brenier, and derive a time-symmetric\nformulation that includes a Fisher information term on the velocity field. We\nunderscore the relation of our problem to recent measure-valued splines in\nWasserstein space, which is akin to that between SBP and Optimal Mass Transport\n(OMT). The connection between the two provides a Sinkhorn-like approach to\ncomputing measure-valued splines. We envision that interpolation between\nmeasures as sought herein will have a wide range of applications in\nsignal/images processing as well as in data science in cases where data have a\ntemporal dimension.\n","authors":"Yongxin Chen|Giovanni Conforti|Tryphon T. Georgiou|Luigia Ripani","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.08319v1","link_pdf":"http://arxiv.org/pdf/1902.08319v1","link_doi":"","comment":"8 pages","journal_ref":"","doi":"","primary_category":"math.OC","categories":"math.OC|math.PR|93E20, 58J65, 46Nxx"} {"id":"1902.08638v1","submitted":"2019-02-22 19:16:32","updated":"2019-02-22 19:16:32","title":"MPP: Model Performance Predictor","abstract":" Operations is a key challenge in the domain of machine learning pipeline\ndeployments involving monitoring and management of real-time prediction\nquality. Typically, metrics like accuracy, RMSE etc., are used to track the\nperformance of models in deployment. However, these metrics cannot be\ncalculated in production due to the absence of labels. We propose using an ML\nalgorithm, Model Performance Predictor (MPP), to track the performance of the\nmodels in deployment. We argue that an ensemble of such metrics can be used to\ncreate a score representing the prediction quality in production. This in turn\nfacilitates formulation and customization of ML alerts, that can be escalated\nby an operations team to the data science team. Such a score automates\nmonitoring and enables ML deployments at scale.\n","authors":"Sindhu Ghanta|Sriram Subramanian|Lior Khermosh|Harshil Shah|Yakov Goldberg|Swaminathan Sundararaman|Drew Roselli|Nisha Talagala","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.08638v1","link_pdf":"http://arxiv.org/pdf/1902.08638v1","link_doi":"","comment":"submitted to OpML 2019","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1902.08681v1","submitted":"2019-02-22 21:49:22","updated":"2019-02-22 21:49:22","title":"Influencing factors that determine the usage of the crowd-shipping\n services","abstract":" The objective of this study is to understand how senders choose shipping\nservices for different products, given the availability of both emerging\ncrowd-shipping (CS) and traditional carriers in a logistics market. Using data\ncollected from a US survey, Random Utility Maximization (RUM) and Random Regret\nMinimization (RRM) models have been employed to reveal factors that influence\nthe diversity of decisions made by senders. Shipping costs, along with\nadditional real-time services such as courier reputations, tracking info,\ne-notifications, and customized delivery time and location, have been found to\nhave remarkable impacts on senders' choices. Interestingly, potential senders\nwere willing to pay more to ship grocery items such as food, beverages, and\nmedicines by CS services. Moreover, the real-time services have low\nelasticities, meaning that only a slight change in those services will lead to\na change in sender-behavior. Finally, data-science techniques were used to\nassess the performance of the RUM and RRM models and found to have similar\naccuracies. The findings from this research will help logistics firms address\npotential market segments, prepare service configurations to fulfill senders'\nexpectations, and develop effective business operations strategies.\n","authors":"Tho V. Le|Satish V. Ukkusuri","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.08681v1","link_pdf":"http://arxiv.org/pdf/1902.08681v1","link_doi":"","comment":"32 pages","journal_ref":"TRR, 2019","doi":"","primary_category":"econ.GN","categories":"econ.GN|q-fin.EC"} {"id":"1902.09302v5","submitted":"2019-02-25 14:43:57","updated":"2019-12-13 19:18:22","title":"Configuration Models of Random Hypergraphs","abstract":" Many empirical networks are intrinsically polyadic, with interactions\noccurring within groups of agents of arbitrary size. There are, however, few\nflexible null models that can support statistical inference for such polyadic\nnetworks. We define a class of null random hypergraphs that hold constant both\nthe node degree and edge dimension sequences, generalizing the classical dyadic\nconfiguration model. We provide a Markov Chain Monte Carlo scheme for sampling\nfrom these models, and discuss connections and distinctions between our\nproposed models and previous approaches. We then illustrate these models\nthrough a triplet of applications. We start with two classical network topics\n-- triadic clustering and degree-assortativity. In each, we emphasize the\nimportance of randomizing over hypergraph space rather than projected graph\nspace, showing that this choice can dramatically alter statistical inference\nand study findings. We then define and study the edge intersection profile of a\nhypergraph as a measure of higher-order correlation between edges, and derive\nasymptotic approximations under the stub-labeled null. Our experiments\nemphasize the ability of explicit, statistically-grounded polyadic modeling to\nsignificantly enhance the toolbox of network data science. We close with\nsuggestions for multiple avenues of future work.\n","authors":"Philip S. Chodrow","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.09302v5","link_pdf":"http://arxiv.org/pdf/1902.09302v5","link_doi":"","comment":"Major revisions to all text and figures","journal_ref":"","doi":"","primary_category":"math.PR","categories":"math.PR|cs.SI|physics.data-an|physics.soc-ph|stat.AP"} {"id":"1903.02521v1","submitted":"2019-02-25 19:16:58","updated":"2019-02-25 19:16:58","title":"Quantifying error contributions of computational steps, algorithms and\n hyperparameter choices in image classification pipelines","abstract":" Data science relies on pipelines that are organized in the form of\ninterdependent computational steps. Each step consists of various candidate\nalgorithms that maybe used for performing a particular function. Each algorithm\nconsists of several hyperparameters. Algorithms and hyperparameters must be\noptimized as a whole to produce the best performance. Typical machine learning\npipelines typically consist of complex algorithms in each of the steps. Not\nonly is the selection process combinatorial, but it is also important to\ninterpret and understand the pipelines. We propose a method to quantify the\nimportance of different layers in the pipeline, by computing an error\ncontribution relative to an agnostic choice of algorithms in that layer. We\ndemonstrate our methodology on image classification pipelines. The agnostic\nmethodology quantifies the error contributions from the computational steps,\nalgorithms and hyperparameters in the image classification pipeline. We show\nthat algorithm selection and hyper-parameter optimization methods can be used\nto quantify the error contribution and that random search is able to quantify\nthe contribution more accurately than Bayesian optimization. This methodology\ncan be used by domain experts to understand machine learning and data analysis\npipelines in terms of their individual components, which can help in\nprioritizing different components of the pipeline.\n","authors":"Aritra Chowdhury|Malik Magdin-Ismail|Bulent Yener","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.02521v1","link_pdf":"http://arxiv.org/pdf/1903.02521v1","link_doi":"","comment":"arXiv admin note: substantial text overlap with arXiv:1903.00405","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV|cs.LG|stat.ML"} {"id":"1902.09584v1","submitted":"2019-02-25 19:41:15","updated":"2019-02-25 19:41:15","title":"Beyond Frequency: Utility Mining with Varied Item-Specific Minimum\n Utility","abstract":" Utility-oriented mining which integrates utility theory and data mining is a\nuseful tool for understanding economic consumer behavior. Traditional\nalgorithms for mining high-utility patterns (HUPs) applies a single/uniform\nminimum high-utility threshold (minutil) to obtain the set of HUPs, but in some\nreal-life circumstances, some specific products may bring lower utilities\ncompared with others, but their profit may offer some vital information.\nHowever, if minutil is set high, the patterns with low minutil are missed; if\nminutil is set low, the number of patterns becomes unmanageable. In this paper,\nan efficient one-phase utility-oriented pattern mining algorithm, called HIMU,\nis proposed for mining HUPs with varied item-specific minimum utility. A novel\ntree structure called a multiple item utility set-enumeration tree (MIU-tree),\nthe global sorted and the conditional downward closure properties are\nintroduced in HIMU. In addition, we extended the compact utility-list structure\nto keep the necessary information, and thus this one-phase HIMU model greatly\nreduces the computational costs and memory requirements. Moreover, two pruning\nstrategies are then extended to enhance the performance. We conducted extensive\nexperiments in several synthetic and real-world datasets; the results indicates\nthat the designed one-phase HIMU algorithm can address the \"rare item problem\"\nand has better performance than the state-of-the-art algorithms in terms of\nruntime, memory usage, and scalability. Furthermore, the enhanced algorithms\noutperform the non-optimized HIMU approach.\n","authors":"Wensheng Gan|Jerry Chun-Wei Lin|Philippe Fournier-Viger|Han-Chieh Chao|Philip S Yu","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.09584v1","link_pdf":"http://arxiv.org/pdf/1902.09584v1","link_doi":"","comment":"Under review in ACM Trans. on Data Science, 31 pages","journal_ref":"","doi":"","primary_category":"cs.DB","categories":"cs.DB"} {"id":"1902.10130v1","submitted":"2019-02-26 01:13:46","updated":"2019-02-26 01:13:46","title":"A Survey on Graph Processing Accelerators: Challenges and Opportunities","abstract":" Graph is a well known data structure to represent the associated\nrelationships in a variety of applications, e.g., data science and machine\nlearning. Despite a wealth of existing efforts on developing graph processing\nsystems for improving the performance and/or energy efficiency on traditional\narchitectures, dedicated hardware solutions, also referred to as graph\nprocessing accelerators, are essential and emerging to provide the benefits\nsignificantly beyond those pure software solutions can offer. In this paper, we\nconduct a systematical survey regarding the design and implementation of graph\nprocessing accelerator. Specifically, we review the relevant techniques in\nthree core components toward a graph processing accelerator: preprocessing,\nparallel graph computation and runtime scheduling. We also examine the\nbenchmarks and results in existing studies for evaluating a graph processing\naccelerator. Interestingly, we find that there is not an absolute winner for\nall three aspects in graph acceleration due to the diverse characteristics of\ngraph processing and complexity of hardware configurations. We finially present\nto discuss several challenges in details, and to further explore the\nopportunities for the future research.\n","authors":"Chuangyi Gui|Long Zheng|Bingsheng He|Cheng Liu|Xinyu Chen|Xiaofei Liao|Hai Jin","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.10130v1","link_pdf":"http://arxiv.org/pdf/1902.10130v1","link_doi":"","comment":"This article has been accepted by Journal of Computer Science and\n Technology","journal_ref":"","doi":"","primary_category":"cs.DC","categories":"cs.DC"} {"id":"1902.10842v3","submitted":"2019-02-28 00:09:10","updated":"2019-06-03 17:13:22","title":"Online Sparse Subspace Clustering","abstract":" This paper focuses on the sparse subspace clustering problem, and develops an\nonline algorithmic solution to cluster data points on-the-fly, without\nrevisiting the whole dataset. The strategy involves an online solution of a\nsparse representation (SR) problem to build a (sparse) dictionary of\nsimilarities where points in the same subspace are considered \"similar,\"\nfollowed by a spectral clustering based on the obtained similarity matrix. When\nthe SR cost is strongly convex, the online solution converges to within a\nneighborhood of the optimal time-varying batch solution. A dynamic regret\nanalysis is performed when the SR cost is not strongly convex.\n","authors":"Liam Madden|Stephen Becker|Emiliano Dall'Anese","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.10842v3","link_pdf":"http://arxiv.org/pdf/1902.10842v3","link_doi":"http://dx.doi.org/10.1109/dsw.2019.8755556","comment":"4 pages, 4 figures. Copyright 2019 IEEE. Published in the 2019 IEEE\n Data Science Workshop (DSW 2019), scheduled for June 4-6, 2019 in\n Minneapolis, Minnesota","journal_ref":"","doi":"10.1109/dsw.2019.8755556","primary_category":"math.OC","categories":"math.OC"} {"id":"1902.11095v1","submitted":"2019-02-28 14:28:30","updated":"2019-02-28 14:28:30","title":"Big Data for Traffic Monitoring and Management","abstract":" The last two decades witnessed tremendous advances in the Information and\nCommunications Technologies. Beside improvements in computational power and\nstorage capacity, communication networks carry nowadays an amount of data which\nwas not envisaged only few years ago. Together with their pervasiveness,\nnetwork complexity increased at the same pace, leaving operators and\nresearchers with few instruments to understand what happens in the networks,\nand, on the global scale, on the Internet. Fortunately, recent advances in data\nscience and machine learning come to the rescue of network analysts, and allow\nanalyses with a level of complexity and spatial/temporal scope not possible\nonly 10 years ago. In my thesis, I take the perspective of an Internet Service\nProvider (ISP), and illustrate challenges and possibilities of analyzing the\ntraffic coming from modern operational networks. I make use of big data and\nmachine learning algorithms, and apply them to datasets coming from passive\nmeasurements of ISP and University Campus networks. The marriage between data\nscience and network measurements is complicated by the complexity of machine\nlearning algorithms, and by the intrinsic multi-dimensionality and variability\nof this kind of data. As such, my work proposes and evaluates novel techniques,\ninspired from popular machine learning approaches, but carefully tailored to\noperate with network traffic.\n","authors":"Martino Trevisan","affiliations":"","link_abstract":"http://arxiv.org/abs/1902.11095v1","link_pdf":"http://arxiv.org/pdf/1902.11095v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.NI","categories":"cs.NI"} {"id":"1903.01829v1","submitted":"2019-03-03 02:55:26","updated":"2019-03-03 02:55:26","title":"Comparison of plotting system outputs in beginner analysts","abstract":" The R programming language is built on an ecosystem of packages, some that\nallow analysts to accomplish the same tasks. For example, there are at least\ntwo clear workflows for creating data visualizations in R: using the base\ngraphics package (referred to as \"base R\") and the ggplot2 add-on package based\non the grammar of graphics. Here we perform an empirical study of the quality\nof scientific graphics produced by beginning R users. In our experiment,\nlearners taking a data science course on the Coursera platform were randomized\nto complete identical plotting exercises in either the base R or the ggplot2\nsystem. Learners were then asked to evaluate their peers in terms of visual\ncharacteristics key to scientific cognition. We observed that graphics created\nwith the two systems rated similarly on many characteristics. However, ggplot2\ngraphics were generally judged to be more visually pleasing and, in the case of\nfaceted scientific plots, easier to understand. Our results suggest that while\nboth graphic systems are useful in the hands of beginning users, ggplot2's\nnatural faceting system may be easier to use by beginning users for displaying\nmore complex relationships.\n","authors":"Leslie Myint|Aboozar Hadavand|Leah Jager|Jeffrey Leek","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.01829v1","link_pdf":"http://arxiv.org/pdf/1903.01829v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT"} {"id":"1903.01678v1","submitted":"2019-03-05 05:31:12","updated":"2019-03-05 05:31:12","title":"Two-Stream Multi-Channel Convolutional Neural Network (TM-CNN) for\n Multi-Lane Traffic Speed Prediction Considering Traffic Volume Impact","abstract":" Traffic speed prediction is a critically important component of intelligent\ntransportation systems (ITS). Recently, with the rapid development of deep\nlearning and transportation data science, a growing body of new traffic speed\nprediction models have been designed, which achieved high accuracy and\nlarge-scale prediction. However, existing studies have two major limitations.\nFirst, they predict aggregated traffic speed rather than lane-level traffic\nspeed; second, most studies ignore the impact of other traffic flow parameters\nin speed prediction. To address these issues, we propose a two-stream\nmulti-channel convolutional neural network (TM-CNN) model for multi-lane\ntraffic speed prediction considering traffic volume impact. In this model, we\nfirst introduce a new data conversion method that converts raw traffic speed\ndata and volume data into spatial-temporal multi-channel matrices. Then we\ncarefully design a two-stream deep neural network to effectively learn the\nfeatures and correlations between individual lanes, in the spatial-temporal\ndimensions, and between speed and volume. Accordingly, a new loss function that\nconsiders the volume impact in speed prediction is developed. A case study\nusing one-year data validates the TM-CNN model and demonstrates its\nsuperiority. This paper contributes to two research areas: (1) traffic speed\nprediction, and (2) multi-lane traffic flow study.\n","authors":"Ruimin Ke|Wan Li|Zhiyong Cui|Yinhai Wang","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.01678v1","link_pdf":"http://arxiv.org/pdf/1903.01678v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1903.01682v2","submitted":"2019-03-05 05:50:32","updated":"2019-08-07 17:54:13","title":"A linear-time algorithm and analysis of graph Relative Hausdorff\n distance","abstract":" Graph similarity metrics serve far-ranging purposes across many domains in\ndata science. As graph datasets grow in size, scientists need comparative tools\nthat capture meaningful differences, yet are lightweight and scalable. Graph\nRelative Hausdorff (RH) distance is a promising, recently proposed measure for\nquantifying degree distribution similarity. In spite of recent interest in RH\ndistance, little is known about its properties. Here, we conduct an algorithmic\nand analytic study of RH distance. In particular, we provide the first\nlinear-time algorithm for computing RH distance, analyze examples of RH\ndistance between pairs of real-world networks as well as structured families of\ngraphs, and prove several analytic results concerning the range, density, and\nextremal behavior of RH distance values.\n","authors":"Sinan G. Aksoy|Kathleen E. Nowak|Stephen J. Young","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.01682v2","link_pdf":"http://arxiv.org/pdf/1903.01682v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.CO","categories":"math.CO|cs.DM|cs.DS"} {"id":"1903.01734v2","submitted":"2019-03-05 09:00:58","updated":"2019-08-30 09:32:37","title":"A Novel Efficient Approach with Data-Adaptive Capability for OMP-based\n Sparse Subspace Clustering","abstract":" Orthogonal Matching Pursuit (OMP) plays an important role in data science and\nits applications such as sparse subspace clustering and image processing.\nHowever, the existing OMP-based approaches lack of data adaptiveness so that\nthe data cannot be represented well enough and may lose the accuracy. This\npaper proposes a novel approach to enhance the data-adaptive capability for\nOMP-based sparse subspace clustering. In our method a parameter selection\nprocess is developed to adjust the parameters based on the data distribution\nfor information representation. Our theoretical analysis indicates that the\nparameter selection process can efficiently coordinate with any OMP-based\nmethods to improve the clustering performance. Also a new\nSelf-Expressive-Affinity (SEA) ratio metric is defined to measure the sparse\nrepresentation conversion efficiency for spectral clustering to obtain data\nsegmentations. Our experiments show that proposed approach can achieve better\nperformances compared with other OMP-based sparse subspace clustering\nalgorithms in terms of clustering accuracy, SEA ratio and representation\nquality, also keep the time efficiency and anti-noise ability.\n","authors":"Jiaqiyu Zhan|Zhiqiang Bai|Yuesheng Zhu","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.01734v2","link_pdf":"http://arxiv.org/pdf/1903.01734v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1903.01913v1","submitted":"2019-03-05 16:02:37","updated":"2019-03-05 16:02:37","title":"Revealing essential dynamics from high-dimensional fluid flow data and\n operators","abstract":" We consider concepts centered around modal analysis, data science, network\nscience, and machine learning to reveal the essential dynamics from\nhigh-dimensional fluid flow data and operators. The presentation of the\nmaterial herein is example-based and follows the author's keynote talk at the\n32nd Computational Fluid Dynamics Symposium (Japan Society of Fluid Mechanics,\nTokyo, December 11-13, 2018). This talk was delivered as a compilation of some\nof the research activities undertaken by the author's research group.\n","authors":"Kunihiko Taira","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.01913v1","link_pdf":"http://arxiv.org/pdf/1903.01913v1","link_doi":"","comment":"10 pages, 8 figures","journal_ref":"","doi":"","primary_category":"physics.flu-dyn","categories":"physics.flu-dyn"} {"id":"1903.03225v1","submitted":"2019-03-08 00:25:42","updated":"2019-03-08 00:25:42","title":"RAMSES II - RAMan Search for Extragalactic Symbiotic Stars. Project\n concept, commissioning, and early results from the science verification phase","abstract":" Symbiotic stars (SySts) are long-period interacting binaries composed of a\nhot compact star, an evolved giant star, and a tangled network of gas and dust\nnebulae. They represent unique laboratories for studying a variety of important\nastrophysical problems, and have also been proposed as possible progenitors of\nSNIa. Presently, we know 257 SySts in the Milky Way and 69 in external\ngalaxies. However, these numbers are still in striking contrast with the\npredicted population of SySts in our Galaxy. Because of other astrophysical\nsources that mimic SySt colors, no photometric diagnostic tool has so far\ndemonstrated the power to unambiguously identify a SySt, thus making the\nrecourse to costly spectroscopic follow-up still inescapable. In this paper we\npresent the concept, commissioning, and science verification phases, as well as\nthe first scientific results, of RAMSES II - a Gemini Observatory Instrument\nUpgrade Project that has provided each GMOS instrument at both Gemini\ntelescopes with a set of narrow-band filters centered on the Raman OVI 6830 A\nband. Continuum-subtracted images using these new filters clearly revealed\nknown SySts with a range of Raman OVI line strengths, even in crowded fields.\nRAMSES II observations also produced the first detection of Raman OVI emission\nfrom the SySt LMC 1 and confirmed Hen 3-1768 as a new SySt - the first\nphotometric confirmation of a SySt. Via Raman OVI narrow-band imaging, RAMSES\nII provides the astronomical community with the first purely photometric tool\nfor hunting SySts in the local Universe.\n","authors":"R. Angeloni|D. R. Gonçalves|S. Akras|G. Gimeno|R. Diaz|J. Scharwächter|N. E. Nuñez|G. J. M. Luna|H. W. Lee|J. E. Heo|A. B. Lucy|M. Jaque Arancibia|C. Moreno|E. Chirre|S. J. Goodsell|P. Soto King|J. L. Sokoloski|B. E. Choi|M. Dias Ribeiro","affiliations":"Instituto de Investigación Multidisciplinar en Ciencia y Tecnología, Universidad de La Serena, La Serena, Chile|Observatório do Valongo, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil|Observatório do Valongo, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil|Gemini Observatory, Southern Operations Center, La Serena, Chile|Gemini Observatory, Southern Operations Center, La Serena, Chile|Gemini Observatory, Northern Operations Center, Hilo, HI, USA|ICATE-CONICET, San Juan, Argentina|IAFE-CONICET, Buenos Aires, Argentina|Department of Physics and Astronomy, Sejong University, Seoul, Republic of Korea|Department of Physics and Astronomy, Sejong University, Seoul, Republic of Korea|Columbia University, Dept. of Astronomy, New York, NY, USA|Departamento de Física y Astronomía, Universidad de La Serena, La Serena, Chile|Gemini Observatory, Southern Operations Center, La Serena, Chile|Gemini Observatory, Southern Operations Center, La Serena, Chile|Gemini Observatory, Northern Operations Center, Hilo, HI, USA|Departamento de Física y Astronomía, Universidad de La Serena, La Serena, Chile|Columbia University, Dept. of Astronomy, New York, NY, USA|Department of Physics and Astronomy, Sejong University, Seoul, Republic of Korea|Observatório do Valongo, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil","link_abstract":"http://arxiv.org/abs/1903.03225v1","link_pdf":"http://arxiv.org/pdf/1903.03225v1","link_doi":"http://dx.doi.org/10.3847/1538-3881/ab0cf7","comment":"23 pages, 16 figures, 5 tables, accepted for publication in AJ","journal_ref":"","doi":"10.3847/1538-3881/ab0cf7","primary_category":"astro-ph.SR","categories":"astro-ph.SR|astro-ph.IM"} {"id":"1903.03254v1","submitted":"2019-03-08 02:32:55","updated":"2019-03-08 02:32:55","title":"An Algorithm for the Visualization of Relevant Patterns in Astronomical\n Light Curves","abstract":" Within the last years, the classification of variable stars with Machine\nLearning has become a mainstream area of research. Recently, visualization of\ntime series is attracting more attention in data science as a tool to visually\nhelp scientists to recognize significant patterns in complex dynamics. Within\nthe Machine Learning literature, dictionary-based methods have been widely used\nto encode relevant parts of image data. These methods intrinsically assign a\ndegree of importance to patches in pictures, according to their contribution in\nthe image reconstruction. Inspired by dictionary-based techniques, we present\nan approach that naturally provides the visualization of salient parts in\nastronomical light curves, making the analogy between image patches and\nrelevant pieces in time series. Our approach encodes the most meaningful\npatterns such that we can approximately reconstruct light curves by just using\nthe encoded information. We test our method in light curves from the OGLE-III\nand StarLight databases. Our results show that the proposed model delivers an\nautomatic and intuitive visualization of relevant light curve parts, such as\nlocal peaks and drops in magnitude.\n","authors":"Christian Pieringer|Karim Pichara|Márcio Catelán|Pavlos Protopapas","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.03254v1","link_pdf":"http://arxiv.org/pdf/1903.03254v1","link_doi":"http://dx.doi.org/10.1093/mnras/stz106","comment":"Accepted 2019 January 8. Received 2019 January 8; in original form\n 2018 January 29. 7 pages, 6 figures","journal_ref":"Monthly Notices of the Astronomical Society, MNRAS 484, 3071 to\n 3077 (2019)","doi":"10.1093/mnras/stz106","primary_category":"astro-ph.IM","categories":"astro-ph.IM|cs.LG|85-08"} {"id":"1903.04479v2","submitted":"2019-03-11 17:56:13","updated":"2020-06-18 16:15:01","title":"Revisiting clustering as matrix factorisation on the Stiefel manifold","abstract":" This paper studies clustering for possibly high dimensional data (e.g.\nimages, time series, gene expression data, and many other settings), and\nrephrase it as low rank matrix estimation in the PAC-Bayesian framework. Our\napproach leverages the well known Burer-Monteiro factorisation strategy from\nlarge scale optimisation, in the context of low rank estimation. Moreover, our\nBurer-Monteiro factors are shown to lie on a Stiefel manifold. We propose a new\ngeneralized Bayesian estimator for this problem and prove novel prediction\nbounds for clustering. We also devise a componentwise Langevin sampler on the\nStiefel manifold to compute this estimator.\n","authors":"Stéphane Chrétien|Benjamin Guedj","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.04479v2","link_pdf":"http://arxiv.org/pdf/1903.04479v2","link_doi":"","comment":"Accepted at the LOD 2020 Conference -- The Sixth International\n Conference on Machine Learning, Optimization, and Data Science","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1903.04772v1","submitted":"2019-03-12 08:04:44","updated":"2019-03-12 08:04:44","title":"Paradox in Deep Neural Networks: Similar yet Different while Different\n yet Similar","abstract":" Machine learning is advancing towards a data-science approach, implying a\nnecessity to a line of investigation to divulge the knowledge learnt by deep\nneuronal networks. Limiting the comparison among networks merely to a\npredefined intelligent ability, according to ground truth, does not suffice, it\nshould be associated with innate similarity of these artificial entities. Here,\nwe analysed multiple instances of an identical architecture trained to classify\nobjects in static images (CIFAR and ImageNet data sets). We evaluated the\nperformance of the networks under various distortions and compared it to the\nintrinsic similarity between their constituent kernels. While we expected a\nclose correspondence between these two measures, we observed a puzzling\nphenomenon. Pairs of networks whose kernels' weights are over 99.9% correlated\ncan exhibit significantly different performances, yet other pairs with no\ncorrelation can reach quite compatible levels of performance. We show\nimplications of this for transfer learning, and argue its importance in our\ngeneral understanding of what intelligence is, whether natural or artificial.\n","authors":"Arash Akbarinia|Karl R. Gegenfurtner","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.04772v1","link_pdf":"http://arxiv.org/pdf/1903.04772v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.CV","categories":"cs.CV"} {"id":"1903.05750v2","submitted":"2019-03-13 23:06:16","updated":"2019-07-26 16:07:33","title":"Modal Analysis of Fluid Flows: Applications and Outlook","abstract":" We present applications of modal analysis techniques to study, model, and\ncontrol canonical aerodynamic flows. To illustrate how modal analysis\ntechniques can provide physical insights in a complementary manner, we selected\nfour fundamental examples of cylinder wakes, wall-bounded flows, airfoil wakes,\nand cavity flows. We also offer brief discussions on the outlook for modal\nanalysis techniques, in light of rapid developments in data science.\n","authors":"Kunihiko Taira|Maziar S. Hemati|Steven L. Brunton|Yiyang Sun|Karthik Duraisamy|Shervin Bagheri|Scott T. M. Dawson|Chi-An Yeh","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.05750v2","link_pdf":"http://arxiv.org/pdf/1903.05750v2","link_doi":"","comment":"37 pages, 19 figures, 2 tables","journal_ref":"","doi":"","primary_category":"physics.flu-dyn","categories":"physics.flu-dyn"} {"id":"1903.06469v3","submitted":"2019-03-15 11:22:56","updated":"2019-08-06 16:17:34","title":"Using Data Science to Understand the Film Industry's Gender Gap","abstract":" Data science can offer answers to a wide range of social science questions.\nHere we turn attention to the portrayal of women in movies, an industry that\nhas a significant influence on society, impacting such aspects of life as\nself-esteem and career choice. To this end, we fused data from the online movie\ndatabase IMDb with a dataset of movie dialogue subtitles to create the largest\navailable corpus of movie social networks (15,540 networks). Analyzing this\ndata, we investigated gender bias in on-screen female characters over the past\ncentury. We find a trend of improvement in all aspects of women`s roles in\nmovies, including a constant rise in the centrality of female characters. There\nhas also been an increase in the number of movies that pass the well-known\nBechdel test, a popular--albeit flawed--measure of women in fiction. Here we\npropose a new and better alternative to this test for evaluating female roles\nin movies. Our study introduces fresh data, an open-code framework, and novel\ntechniques that present new opportunities in the research and analysis of\nmovies.\n","authors":"Dima Kagan|Thomas Chesney|Michael Fire","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.06469v3","link_pdf":"http://arxiv.org/pdf/1903.06469v3","link_doi":"http://dx.doi.org/10.1057/s41599-020-0436-1","comment":"","journal_ref":"Palgrave Commun 6, 92 (2020)","doi":"10.1057/s41599-020-0436-1","primary_category":"cs.SI","categories":"cs.SI|cs.CY|physics.data-an"} {"id":"1903.08381v1","submitted":"2019-03-20 08:28:16","updated":"2019-03-20 08:28:16","title":"The Promise of Data Science for the Technosignatures Field","abstract":" This paper outlines some of the possible advancements for the\ntechnosignatures searches using the new methods currently rapidly developing in\ncomputer science, such as machine learning and deep learning. It also showcases\na couple of case studies of large research programs where such methods have\nbeen already successfully implemented with notable results. We consider that\nthe availability of data from all sky, all the time observations paired with\nthe latest developments in computational capabilities and algorithms currently\nused in artificial intelligence, including automation, will spur an\nunprecedented development of the technosignatures search efforts.\n","authors":"Anamaria Berea|Steve Croft|Daniel Angerhausen","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.08381v1","link_pdf":"http://arxiv.org/pdf/1903.08381v1","link_doi":"","comment":"Science white paper submitted in response to the the U.S. National\n Academies of Science, Engineering, and Medicine's call for community input to\n the Astro2020 Decadal Survey; 7 pages, 1 figure","journal_ref":"","doi":"","primary_category":"astro-ph.IM","categories":"astro-ph.IM|astro-ph.EP|physics.data-an"} {"id":"1903.08426v1","submitted":"2019-03-20 10:34:44","updated":"2019-03-20 10:34:44","title":"Comparison of Multi-response Prediction Methods","abstract":" While data science is battling to extract information from the enormous\nexplosion of data, many estimators and algorithms are being developed for\nbetter prediction. Researchers and data scientists often introduce new methods\nand evaluate them based on various aspects of data. However, studies on the\nimpact of/on a model with multiple response variables are limited. This study\ncompares some newly-developed (envelope) and well-established (PLS, PCR)\nprediction methods based on real data and simulated data specifically designed\nby varying properties such as multicollinearity, the correlation between\nmultiple responses and position of relevant principal components of predictors.\nThis study aims to give some insight into these methods and help the researcher\nto understand and use them in further studies.\n","authors":"Raju Rimal|Trygve Almøy|Solve Sæbø","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.08426v1","link_pdf":"http://arxiv.org/pdf/1903.08426v1","link_doi":"http://dx.doi.org/10.1016/j.chemolab.2019.05.004","comment":"22 pages, 13 figures","journal_ref":"","doi":"10.1016/j.chemolab.2019.05.004","primary_category":"stat.AP","categories":"stat.AP"} {"id":"1903.08760v1","submitted":"2019-03-20 22:02:10","updated":"2019-03-20 22:02:10","title":"Nucleon Femtography from Exclusive Reactions","abstract":" Major breakthroughs over the last two decades have led us to access\ninformation on how the nucleon's mass, spin and mechanical properties are\ngenerated from its quark and gluon degrees of freedom. On one side, a\ntheoretical framework has been developed which enables the extraction of 3D\nparton distributions from deeply virtual exclusive scattering experiments. On\nthe other hand, the so called gravitomagnetic form factors parameterizing the\nQCD energy momentum tensor of the nucleon have been connected to the Mellin\nmoments of the 3D parton distributions. Current efforts in both experiment and\ntheory are being directed at using information from electron scattering\nexperiments to map out and eventually visualize these 3D distributions as well\nas the mass, spin and mechanical properties of the nucleon and of spin 0, 1/2,\nand 1 nuclei. A new science of nucleon femtography is emerging where the 3D\nstructure of the nucleon will be studied merging information from current and\nforthcoming data from various facilities including the future EIC and the new\ninflux of data science, imaging, and visualization.\n","authors":"Simonetta Liuti","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.08760v1","link_pdf":"http://arxiv.org/pdf/1903.08760v1","link_doi":"","comment":"12 pages, 4 figures, Plenary Talk at \"SPIN 2018, Ferrara, Italy,\n September 10-14, 2018\"","journal_ref":"","doi":"","primary_category":"hep-ph","categories":"hep-ph"} {"id":"1904.05326v1","submitted":"2019-03-22 17:38:04","updated":"2019-03-22 17:38:04","title":"Tending Unmarked Graves: Classification of Post-mortem Content on Social\n Media","abstract":" User-generated content is central to social computing scholarship. However,\nresearchers and practitioners often presume that these users are alive. Failing\nto account for mortality is problematic in social media where an increasing\nnumber of profiles represent those who have died. Identifying mortality can\nempower designers to better manage content and support the bereaved, as well as\npromote high-quality data science. Based on a computational linguistic analysis\nof post-mortem social media profiles and content, we report on classifiers\ndeveloped to detect mortality and show that mortality can be determined after\nthe first few occurrences of post-mortem content. Applying our classifiers to\ncontent from two other platforms also provided good results. Finally, we\ndiscuss trade-offs between models that emphasize pre- vs. post-mortem precision\nin this sensitive context. These results mark a first step toward identifying\nmortality at scale, and show how designers and scientists can attend to\nmortality in their work.\n","authors":"Jialun \"Aaron\" Jiang|Jed R. Brubaker","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.05326v1","link_pdf":"http://arxiv.org/pdf/1904.05326v1","link_doi":"http://dx.doi.org/10.1145/3274350","comment":"","journal_ref":"Proc. ACM Hum.-Comput. Interact. 2, CSCW: Article 81 (2018)","doi":"10.1145/3274350","primary_category":"cs.SI","categories":"cs.SI|cs.CY"} {"id":"1903.11241v1","submitted":"2019-03-26 13:27:37","updated":"2019-03-26 13:27:37","title":"Data Science and Digital Systems: The 3Ds of Machine Learning Systems\n Design","abstract":" Machine learning solutions, in particular those based on deep learning\nmethods, form an underpinning of the current revolution in \"artificial\nintelligence\" that has dominated popular press headlines and is having a\nsignificant influence on the wider tech agenda. Here we give an overview of the\n3Ds of ML systems design: Data, Design and Deployment. By considering the 3Ds\nwe can move towards \\emph{data first} design.\n","authors":"Neil D. Lawrence","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.11241v1","link_pdf":"http://arxiv.org/pdf/1903.11241v1","link_doi":"","comment":"Paper presented at the Stu Hunter Research Conference held at the\n Villa Porro Pirelli in Induno Olona, Italy, from Sunday February 17th to\n Wednesday February 20th, 2019","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.AI"} {"id":"1903.11406v3","submitted":"2019-03-27 13:09:16","updated":"2020-05-14 19:58:51","title":"Analyzing Knowledge Graph Embedding Methods from a Multi-Embedding\n Interaction Perspective","abstract":" Knowledge graph is a popular format for representing knowledge, with many\napplications to semantic search engines, question-answering systems, and\nrecommender systems. Real-world knowledge graphs are usually incomplete, so\nknowledge graph embedding methods, such as Canonical decomposition/Parallel\nfactorization (CP), DistMult, and ComplEx, have been proposed to address this\nissue. These methods represent entities and relations as embedding vectors in\nsemantic space and predict the links between them. The embedding vectors\nthemselves contain rich semantic information and can be used in other\napplications such as data analysis. However, mechanisms in these models and the\nembedding vectors themselves vary greatly, making it difficult to understand\nand compare them. Given this lack of understanding, we risk using them\nineffectively or incorrectly, particularly for complicated models, such as CP,\nwith two role-based embedding vectors, or the state-of-the-art ComplEx model,\nwith complex-valued embedding vectors. In this paper, we propose a\nmulti-embedding interaction mechanism as a new approach to uniting and\ngeneralizing these models. We derive them theoretically via this mechanism and\nprovide empirical analyses and comparisons between them. We also propose a new\nmulti-embedding model based on quaternion algebra and show that it achieves\npromising results using popular benchmarks. Source code is available on github\nat https://github.com/tranhungnghiep/AnalyzingKGEmbeddings\n","authors":"Hung Nghiep Tran|Atsuhiro Takasu","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.11406v3","link_pdf":"http://arxiv.org/pdf/1903.11406v3","link_doi":"","comment":"DSI4 at EDBT/ICDT 2019. Source code is available on github at\n https://github.com/tranhungnghiep/AnalyzingKGEmbeddings","journal_ref":"Data Science for Industry 4.0 at EDBT/ICDT 2019","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.AI|stat.ML"} {"id":"1904.03990v1","submitted":"2019-03-27 14:36:19","updated":"2019-03-27 14:36:19","title":"Import2vec - Learning Embeddings for Software Libraries","abstract":" We consider the problem of developing suitable learning representations\n(embeddings) for library packages that capture semantic similarity among\nlibraries. Such representations are known to improve the performance of\ndownstream learning tasks (e.g. classification) or applications such as\ncontextual search and analogical reasoning.\n We apply word embedding techniques from natural language processing (NLP) to\ntrain embeddings for library packages (\"library vectors\"). Library vectors\nrepresent libraries by similar context of use as determined by import\nstatements present in source code. Experimental results obtained from training\nsuch embeddings on three large open source software corpora reveals that\nlibrary vectors capture semantically meaningful relationships among software\nlibraries, such as the relationship between frameworks and their plug-ins and\nlibraries commonly used together within ecosystems such as big data\ninfrastructure projects (in Java), front-end and back-end web development\nframeworks (in JavaScript) and data science toolkits (in Python).\n","authors":"Bart Theeten|Frederik Vandeputte|Tom Van Cutsem","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.03990v1","link_pdf":"http://arxiv.org/pdf/1904.03990v1","link_doi":"","comment":"MSR19 Conference 11 pages","journal_ref":"","doi":"","primary_category":"cs.SE","categories":"cs.SE|cs.IR|cs.LG|stat.ML"} {"id":"1904.05328v1","submitted":"2019-03-28 03:38:05","updated":"2019-03-28 03:38:05","title":"PropTech for Proactive Pricing of Houses in Classified Advertisements in\n the Indian Real Estate Market","abstract":" Property Technology (PropTech) is the next big thing that is going to disrupt\nthe real estate market. Nowadays, we see applications of Machine Learning (ML)\nand Artificial Intelligence (AI) in almost all the domains but for a long time\nthe real estate industry was quite slow in adopting data science and machine\nlearning for problem solving and improving their processes. However, things are\nchanging quite fast as we see a lot of adoption of AI and ML in the US and\nEuropean real estate markets. But the Indian real estate market has to catch-up\na lot. This paper proposes a machine learning approach for solving the house\nprice prediction problem in the classified advertisements. This study focuses\non the Indian real estate market. We apply advanced machine learning algorithms\nsuch as Random forest, Gradient boosting and Artificial neural networks on a\nreal world dataset and compare the performance of these methods. We find that\nthe Random forest method is the best performer in terms of prediction accuracy.\n","authors":"Sayan Putatunda","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.05328v1","link_pdf":"http://arxiv.org/pdf/1904.05328v1","link_doi":"","comment":"8 pages, PoC paper","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI"} {"id":"1903.12157v1","submitted":"2019-03-28 17:43:09","updated":"2019-03-28 17:43:09","title":"Resilient Combination of Complementary CNN and RNN Features for Text\n Classification through Attention and Ensembling","abstract":" State-of-the-art methods for text classification include several distinct\nsteps of pre-processing, feature extraction and post-processing. In this work,\nwe focus on end-to-end neural architectures and show that the best performance\nin text classification is obtained by combining information from different\nneural modules. Concretely, we combine convolution, recurrent and attention\nmodules with ensemble methods and show that they are complementary. We\nintroduce ECGA, an end-to-end go-to architecture for novel text classification\ntasks. We prove that it is efficient and robust, as it attains or surpasses the\nstate-of-the-art on varied datasets, including both low and high data regimes.\n","authors":"Athanasios Giannakopoulos|Maxime Coriou|Andreea Hossmann|Michael Baeriswyl|Claudiu Musat","affiliations":"","link_abstract":"http://arxiv.org/abs/1903.12157v1","link_pdf":"http://arxiv.org/pdf/1903.12157v1","link_doi":"","comment":"5 pages, 1 figure, SDS 2019 - The 6th Swiss Conference on Data\n Science","journal_ref":"","doi":"","primary_category":"cs.CL","categories":"cs.CL"} {"id":"1904.00176v1","submitted":"2019-03-30 09:08:45","updated":"2019-03-30 09:08:45","title":"Nonparametric Density Estimation for High-Dimensional Data - Algorithms\n and Applications","abstract":" Density Estimation is one of the central areas of statistics whose purpose is\nto estimate the probability density function underlying the observed data. It\nserves as a building block for many tasks in statistical inference,\nvisualization, and machine learning. Density Estimation is widely adopted in\nthe domain of unsupervised learning especially for the application of\nclustering. As big data become pervasive in almost every area of data sciences,\nanalyzing high-dimensional data that have many features and variables appears\nto be a major focus in both academia and industry. High-dimensional data pose\nchallenges not only from the theoretical aspects of statistical inference, but\nalso from the algorithmic/computational considerations of machine learning and\ndata analytics. This paper reviews a collection of selected nonparametric\ndensity estimation algorithms for high-dimensional data, some of them are\nrecently published and provide interesting mathematical insights. The important\napplication domain of nonparametric density estimation, such as { modal\nclustering}, are also included in this paper. Several research directions\nrelated to density estimation and high-dimensional data analysis are suggested\nby the authors.\n","authors":"Zhipeng Wang|David W. Scott","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.00176v1","link_pdf":"http://arxiv.org/pdf/1904.00176v1","link_doi":"http://dx.doi.org/10.1002/wics.1461","comment":"","journal_ref":"Wiley Interdisciplinary Reviews: Computational Statistics, 2019","doi":"10.1002/wics.1461","primary_category":"stat.ML","categories":"stat.ML|cs.LG|stat.CO"} {"id":"1904.00521v1","submitted":"2019-04-01 01:03:57","updated":"2019-04-01 01:03:57","title":"Adaptive Ensemble Learning of Spatiotemporal Processes with Calibrated\n Predictive Uncertainty: A Bayesian Nonparametric Approach","abstract":" Ensemble learning is a mainstay in modern data science practice. Conventional\nensemble algorithms assign to base models a set of deterministic, constant\nmodel weights that (1) do not fully account for individual models' varying\naccuracy across data subgroups, nor (2) provide uncertainty estimates for the\nensemble prediction. These shortcomings can yield predictions that are precise\nbut biased, which can negatively impact the performance of the algorithm in\nreal-word applications. In this work, we present an adaptive, probabilistic\napproach to ensemble learning using a transformed Gaussian process as a prior\nfor the ensemble weights. Given input features, our method optimally combines\nbase models based on their predictive accuracy in the feature space, and\nprovides interpretable estimates of the uncertainty associated with both model\nselection, as reflected by the ensemble weights, and the overall ensemble\npredictions. Furthermore, to ensure that this quantification of the model\nuncertainty is accurate, we propose additional machinery to non-parametrically\nmodel the ensemble's predictive cumulative density function (CDF) so that it is\nconsistent with the empirical distribution of the data. We apply the proposed\nmethod to data simulated from a nonlinear regression model, and to generate a\nspatial prediction model and associated prediction uncertainties for fine\nparticle levels in eastern Massachusetts, USA.\n","authors":"Jeremiah Zhe Liu|John Paisley|Marianthi-Anna Kioumourtzoglou|Brent A. Coull","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.00521v1","link_pdf":"http://arxiv.org/pdf/1904.00521v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ME","categories":"stat.ME"} {"id":"1904.02059v4","submitted":"2019-04-03 15:27:37","updated":"2020-08-04 01:57:32","title":"Tunable Eigenvector-Based Centralities for Multiplex and Temporal\n Networks","abstract":" Characterizing the importances (i.e., centralities) of nodes in social,\nbiological, and technological networks is a core topic in both network science\nand data science. We present a linear-algebraic framework that generalizes\neigenvector-based centralities, including PageRank and hub/authority scores, to\nprovide a common framework for two popular classes of multilayer networks:\nmultiplex networks (which have layers that encode different types of\nrelationships) and temporal networks (in which the relationships change over\ntime). Our approach involves the study of joint, marginal, and conditional\n\"supracentralities\" that one can calculate from the dominant eigenvector of a\nsupracentrality matrix [Taylor et al., 2017], which couples centrality matrices\nthat are associated with individual network layers. We extend this prior work\n(which was restricted to temporal networks with layers that are coupled by\nadjacent-in-time coupling) by allowing the layers to be coupled through a\n(possibly asymmetric) interlayer-adjacency matrix $\\tilde{{\\bf A}}$, where the\nentry $\\tilde{A}_{tt'} \\geq 0$ encodes the coupling between layers $t$ and\n$t'$. Our framework provides a unifying foundation for centrality analysis of\nmultiplex and temporal networks; it also illustrates a complicated dependency\nof the supracentralities on the topology and weights of interlayer coupling. By\nscaling $\\tilde{{\\bf A}}$ by an interlayer-coupling strength $\\omega\\ge0$ and\ndeveloping a singular perturbation theory for the limits of weak\n($\\omega\\to0^+$) and strong coupling ($\\omega\\to\\infty$), we also reveal an\ninteresting dependence of supracentralities on the dominant left and right\neigenvectors of $\\tilde{{\\bf A}}$.\n","authors":"Dane Taylor|Mason A. Porter|Peter J. Mucha","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.02059v4","link_pdf":"http://arxiv.org/pdf/1904.02059v4","link_doi":"","comment":"35 pages with 7 figures, followed by 5 pages of supporting material","journal_ref":"","doi":"","primary_category":"cs.SI","categories":"cs.SI|cs.NA|math.NA|physics.soc-ph"} {"id":"1904.03130v1","submitted":"2019-04-05 15:54:38","updated":"2019-04-05 15:54:38","title":"Unsupervised Low Latency Speech Enhancement with RT-GCC-NMF","abstract":" In this paper, we present RT-GCC-NMF: a real-time (RT), two-channel blind\nspeech enhancement algorithm that combines the non-negative matrix\nfactorization (NMF) dictionary learning algorithm with the generalized\ncross-correlation (GCC) spatial localization method. Using a pre-learned\nuniversal NMF dictionary, RT-GCC-NMF operates in a frame-by-frame fashion by\nassociating individual dictionary atoms to target speech or background\ninterference based on their estimated time-delay of arrivals (TDOA). We\nevaluate RT-GCC-NMF on two-channel mixtures of speech and real-world noise from\nthe Signal Separation and Evaluation Campaign (SiSEC). We demonstrate that this\napproach generalizes to new speakers, acoustic environments, and recording\nsetups from very little training data, and outperforms all but one of the\nalgorithms from the SiSEC challenge in terms of overall Perceptual Evaluation\nmethods for Audio Source Separation (PEASS) scores and compares favourably to\nthe ideal binary mask baseline. Over a wide range of input SNRs, we show that\nthis approach simultaneously improves the PEASS and signal to noise ratio\n(SNR)-based Blind Source Separation (BSS) Eval objective quality metrics as\nwell as the short-time objective intelligibility (STOI) and extended STOI\n(ESTOI) objective speech intelligibility metrics. A flexible, soft masking\nfunction in the space of NMF activation coefficients offers real-time control\nof the trade-off between interference suppression and target speaker fidelity.\nFinally, we use an asymmetric short-time Fourier transform (STFT) to reduce the\ninherent algorithmic latency of RT-GCC-NMF from 64 ms to 2 ms with no loss in\nperformance. We demonstrate that latencies within the tolerable range for\nhearing aids are possible on current hardware platforms.\n","authors":"Sean U. N. Wood|Jean Rouat","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.03130v1","link_pdf":"http://arxiv.org/pdf/1904.03130v1","link_doi":"http://dx.doi.org/10.1109/JSTSP.2019.2909193","comment":"Accepted for publication in the IEEE JSTSP Special Issue on Data\n Science: Machine Learning for Audio Signal Processing","journal_ref":"","doi":"10.1109/JSTSP.2019.2909193","primary_category":"eess.AS","categories":"eess.AS|cs.SD"} {"id":"1904.03160v1","submitted":"2019-04-05 17:04:57","updated":"2019-04-05 17:04:57","title":"Discrete Fourier Transform Improves the Prediction of the Electronic\n Properties of Molecules in Quantum Machine Learning","abstract":" High-throughput approximations of quantum mechanics calculations and\ncombinatorial experiments have been traditionally used to reduce the search\nspace of possible molecules, drugs and materials. However, the interplay of\nstructural and chemical degrees of freedom introduces enormous complexity,\nwhich the current state-of-the-art tools are not yet designed to handle. The\navailability of large molecular databases generated by quantum mechanics (QM)\ncomputations using first principles open new venues for data science to\naccelerate the discovery of new compounds. In recent years, models that combine\nQM with machine learning (ML) known as QM/ML models have been successful at\ndelivering the accuracy of QM at the speed of ML. The goals are to develop a\nframework that will accelerate the extraction of knowledge and to get insights\nfrom quantitative process-structure-property-performance relationships hidden\nin materials data via a better search of the chemical compound space, and to\ninfer new materials with targeted properties. In this study, we show that by\nintegrating well-known signal processing techniques such as discrete Fourier\ntransform in the QM/ML pipeline, the outcomes can be significantly improved in\nsome cases. We also show that the spectrogram of a molecule may represent an\ninteresting molecular visualization tool.\n","authors":"Alain Tchagang|Julio Valdés","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.03160v1","link_pdf":"http://arxiv.org/pdf/1904.03160v1","link_doi":"http://dx.doi.org/10.1109/CCECE.2019.8861895","comment":"4 pages, 3 figures, 2 tables. Accepted to present at 32nd IEEE\n Canadian Conference in Electrical Engineering and Computer Science","journal_ref":"2019 IEEE Canadian Conference of Electrical and Computer\n Engineering (CCECE)","doi":"10.1109/CCECE.2019.8861895","primary_category":"quant-ph","categories":"quant-ph|cond-mat.mtrl-sci|physics.comp-ph"} {"id":"1904.03766v5","submitted":"2019-04-07 23:04:37","updated":"2020-02-24 02:20:50","title":"Generalized Persistence Algorithm for Decomposing Multi-parameter\n Persistence Modules","abstract":" The classical persistence algorithm virtually computes the unique\ndecomposition of a persistence module implicitly given by an input simplicial\nfiltration. Based on matrix reduction, this algorithm is a cornerstone of the\nemergent area of topological data analysis. Its input is a simplicial\nfiltration defined over integers $\\mathbb{Z}$ giving rise to a $1$-parameter\npersistence module. It has been recognized that multi-parameter version of\npersistence modules given by simplicial filtrations over $d$-dimensional\ninteger grids $\\mathbb{Z}^d$ is equally or perhaps more important in data\nscience applications. However, in the multi-parameter setting, one of the main\nbottlenecks is that topological summaries such as barcodes and distances among\nthem cannot be as efficiently computed as in the $1$-parameter case because\nthere is no known generalization of the persistence algorithm for computing the\ndecomposition of multi-parameter persistence modules. The Meataxe algorithm, a\npopular one known for computing such a decomposition runs in\n$\\tilde{O}(n^{6(d+1)})$ time where $n$ is the size of the input filtration. We\npresent for the first time a generalization of the persistence algorithm based\non a generalized matrix reduction technique that runs in $O(n^{2\\omega+1})$\ntime for $d=2$ and in $O(n^{d(2\\omega +1)})$ time for $d>2$ where\n$\\omega<2.373$ is the exponent for matrix multiplication. Various structural\nand computational results connecting the graded modules from commutative\nalgebra to matrix reductions are established through the course.\n","authors":"Tamal K. Dey|Cheng Xin","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.03766v5","link_pdf":"http://arxiv.org/pdf/1904.03766v5","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.AT","categories":"math.AT|cs.CG"} {"id":"1904.04192v1","submitted":"2019-04-08 17:12:29","updated":"2019-04-08 17:12:29","title":"The fragility of decentralised trustless socio-technical systems","abstract":" The blockchain technology promises to transform finance, money and even\ngovernments. However, analyses of blockchain applicability and robustness\ntypically focus on isolated systems whose actors contribute mainly by running\nthe consensus algorithm. Here, we highlight the importance of considering\ntrustless platforms within the broader ecosystem that includes social and\ncommunication networks. As an example, we analyse the flash-crash observed on\n21st June 2017 in the Ethereum platform and show that a major phenomenon of\nsocial coordination led to a catastrophic cascade of events across several\ninterconnected systems. We propose the concept of ``emergent centralisation''\nto describe situations where a single system becomes critically important for\nthe functioning of the whole ecosystem, and argue that such situations are\nlikely to become more and more frequent in interconnected socio-technical\nsystems. We anticipate that the systemic approach we propose will have\nimplications for future assessments of trustless systems and call for the\nattention of policy-makers on the fragility of our interconnected and rapidly\nchanging world.\n","authors":"Manlio De Domenico|Andrea Baronchelli","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.04192v1","link_pdf":"http://arxiv.org/pdf/1904.04192v1","link_doi":"http://dx.doi.org/10.1140/epjds/s13688-018-0180-6","comment":"Commentary published in EPJ Data Science","journal_ref":"EPJ Data Science 8:2 (2019)","doi":"10.1140/epjds/s13688-018-0180-6","primary_category":"physics.soc-ph","categories":"physics.soc-ph|cs.SI|q-fin.TR"} {"id":"1904.04912v2","submitted":"2019-04-09 21:06:55","updated":"2019-11-17 21:16:45","title":"Enhancing Time Series Momentum Strategies Using Deep Neural Networks","abstract":" While time series momentum is a well-studied phenomenon in finance, common\nstrategies require the explicit definition of both a trend estimator and a\nposition sizing rule. In this paper, we introduce Deep Momentum Networks -- a\nhybrid approach which injects deep learning based trading rules into the\nvolatility scaling framework of time series momentum. The model also\nsimultaneously learns both trend estimation and position sizing in a\ndata-driven manner, with networks directly trained by optimising the Sharpe\nratio of the signal. Backtesting on a portfolio of 88 continuous futures\ncontracts, we demonstrate that the Sharpe-optimised LSTM improved traditional\nmethods by more than two times in the absence of transactions costs, and\ncontinue outperforming when considering transaction costs up to 2-3 basis\npoints. To account for more illiquid assets, we also propose a turnover\nregularisation term which trains the network to factor in costs at run-time.\n","authors":"Bryan Lim|Stefan Zohren|Stephen Roberts","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.04912v2","link_pdf":"http://arxiv.org/pdf/1904.04912v2","link_doi":"","comment":"","journal_ref":"The Journal of Financial Data Science, Fall 2019","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG|q-fin.TR"} {"id":"1904.05619v2","submitted":"2019-04-11 10:50:33","updated":"2019-04-13 12:47:55","title":"A Stochastic LBFGS Algorithm for Radio Interferometric Calibration","abstract":" We present a stochastic, limited-memory Broyden Fletcher Goldfarb Shanno\n(LBFGS) algorithm that is suitable for handling very large amounts of data. A\ndirect application of this algorithm is radio interferometric calibration of\nraw data at fine time and frequency resolution. Almost all existing radio\ninterferometric calibration algorithms assume that it is possible to fit the\ndataset being calibrated into memory. Therefore, the raw data is averaged in\ntime and frequency to reduce its size by many orders of magnitude before\ncalibration is performed. However, this averaging is detrimental for the\ndetection of some signals of interest that have narrow bandwidth and time\nduration such as fast radio bursts (FRBs). Using the proposed algorithm, it is\npossible to calibrate data at such a fine resolution that they cannot be\nentirely loaded into memory, thus preserving such signals. As an additional\ndemonstration, we use the proposed algorithm for training deep neural networks\nand compare the performance against the mainstream first order optimization\nalgorithms that are used in deep learning.\n","authors":"Sarod Yatawatta|Lukas De Clercq|Hanno Spreeuw|Faruk Diblen","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.05619v2","link_pdf":"http://arxiv.org/pdf/1904.05619v2","link_doi":"","comment":"Draft, final version in IEEE Data Science Workshop 2019 proceedings","journal_ref":"","doi":"","primary_category":"astro-ph.IM","categories":"astro-ph.IM|cs.LG|math.OC"} {"id":"1904.05933v1","submitted":"2019-04-11 19:36:29","updated":"2019-04-11 19:36:29","title":"Constructive expansion for quartic vector fields theories. I. Low\n dimensions","abstract":" This paper is the first of a series aiming at proving rigorously the\nanalyticity and the Borel summability of generic quartic bosonic and fermionic\nvector models (generalizing the O(N) vector model) in diverse dimensions. Both\nnon-relativistic (Schr\\\"odinger) and relativistic (Klein-Gordon and Dirac)\nkinetic terms are considered. The 4-tensor defining the interactions is\nconstant but otherwise arbitrary, up to the symmetries imposed by the\nstatistics of the field. In this paper, we focus on models of low dimensions:\nbosons and fermions for d = 0, 1, and relativistic bosons for d = 2. Moreover,\nwe investigate the large N and massless limits along with quenching for\nfermions in d = 1. These results are established using the loop vertex\nexpansion (LVE) and have applications in different fields, including data\nsciences, condensed matter and string field theory. In particular, this\nestablishes the Borel summability of the SYK model both at finite and large N.\n","authors":"Harold Erbin|Vincent Lahoche|Mohamed Tamaazousti","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.05933v1","link_pdf":"http://arxiv.org/pdf/1904.05933v1","link_doi":"","comment":"77 pages","journal_ref":"","doi":"","primary_category":"hep-th","categories":"hep-th|cond-mat.stat-mech|math-ph|math.MP"} {"id":"1904.07989v2","submitted":"2019-04-16 21:16:17","updated":"2019-04-30 19:15:31","title":"COMBIgor: data analysis package for combinatorial materials science","abstract":" Combinatorial experiments involve synthesis of sample libraries with lateral\ncomposition gradients requiring spatially-resolved characterization of\nstructure and properties. Due to maturation of combinatorial methods and their\nsuccessful application in many fields, the modern combinatorial laboratory\nproduces diverse and complex data sets requiring advanced analysis and\nvisualization techniques. In order to utilize these large data sets to uncover\nnew knowledge, the combinatorial scientist must engage in data science. For\ndata science tasks, most laboratories adopt common-purpose data management and\nvisualization software. However, processing and cross-correlating data from\nvarious measurement tools is no small task for such generic programs. Here we\ndescribe COMBIgor, a purpose-built open-source software package written in the\ncommercial Igor Pro environment, designed to offer a systematic approach to\nloading, storing, processing, and visualizing combinatorial data sets. It\nincludes (1) methods for loading and storing data sets from combinatorial\nlibraries, (2) routines for streamlined data processing, and (3) data analysis\nand visualization features to construct figures. Most importantly, COMBIgor is\ndesigned to be easily customized by a laboratory, group, or individual in order\nto integrate additional instruments and data-processing algorithms. Utilizing\nthe capabilities of COMBIgor can significantly reduce the burden of data\nmanagement on the combinatorial scientist.\n","authors":"Kevin R. Talley|Sage R. Bauers|Celeste L. Melamed|Meagan C. Papac|Karen Heinselman|Imran Khan|Dennice M. Roberts|Valerie Jacobson|Allison Mis|Geoff L. Brennecka|John D. Perkins|Andriy Zakutayev","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.07989v2","link_pdf":"http://arxiv.org/pdf/1904.07989v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cond-mat.mtrl-sci","categories":"cond-mat.mtrl-sci|physics.comp-ph"} {"id":"1904.07998v2","submitted":"2019-04-16 22:10:19","updated":"2019-11-11 01:48:59","title":"SynC: A Unified Framework for Generating Synthetic Population with\n Gaussian Copula","abstract":" Synthetic population generation is the process of combining multiple\nsocioeconomic and demographic datasets from different sources and/or\ngranularity levels, and downscaling them to an individual level. Although it is\na fundamental step for many data science tasks, an efficient and standard\nframework is absent. In this study, we propose a multi-stage framework called\nSynC (Synthetic Population via Gaussian Copula) to fill the gap. SynC first\nremoves potential outliers in the data and then fits the filtered data with a\nGaussian copula model to correctly capture dependencies and marginal\ndistributions of sampled survey data. Finally, SynC leverages predictive models\nto merge datasets into one and then scales them accordingly to match the\nmarginal constraints. We make three key contributions in this work: 1) propose\na novel framework for generating individual level data from aggregated data\nsources by combining state-of-the-art machine learning and statistical\ntechniques, 2) demonstrate its value as a feature engineering tool, as well as\nan alternative to data collection in situations where gathering is difficult\nthrough two real-world datasets, 3) release an easy-to-use framework\nimplementation for reproducibility, and 4) ensure the methodology is scalable\nat the production level and can easily incorporate new data.\n","authors":"Colin Wan|Zheng Li|Alicia Guo|Yue Zhao","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.07998v2","link_pdf":"http://arxiv.org/pdf/1904.07998v2","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1904.08532v1","submitted":"2019-04-17 23:17:48","updated":"2019-04-17 23:17:48","title":"Stable recovery and the coordinate small-ball behaviour of random\n vectors","abstract":" Recovery procedures in various application in Data Science are based on\n\\emph{stable point separation}. In its simplest form, stable point separation\nimplies that if $f$ is \"far away\" from $0$, and one is given a random sample\n$(f(Z_i))_{i=1}^m$ where a proportional number of the sample points may be\ncorrupted by noise, that information is still enough to exhibit that $f$ is far\nfrom $0$.\n Stable point separation is well understood in the context of iid sampling,\nand to explore it for general sampling methods we introduce a new notion---the\n\\emph{coordinate small-ball} of a random vector $X$. Roughly put, this feature\ncaptures the number of \"relatively large coordinates\" of\n$(||)_{i=1}^m$, where $T:\\mathbb{R}^n \\to \\mathbb{R}^m$ is an arbitrary\nlinear operator and $(u_i)_{i=1}^m$ is any fixed orthonormal basis of\n$\\mathbb{R}^m$.\n We show that under the bare-minimum assumptions on $X$, and with high\nprobability, many of the values $||$ are at least of the order\n$\\|T\\|_{S_2}/\\sqrt{m}$. As a result, the \"coordinate structure\" of $TX$\nexhibits the typical Euclidean norm of $TX$ and does so in a stable way.\n One outcome of our analysis is that random sub-sampled convolutions satisfy\nstable point separation under minimal assumptions on the generating random\nvector---a fact that was known previously only in a highly restrictive setup,\nnamely, for random vectors with iid subgaussian coordinates.\n","authors":"Shahar Mendelson|Grigoris Paouris","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.08532v1","link_pdf":"http://arxiv.org/pdf/1904.08532v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"math.PR","categories":"math.PR|math.ST|stat.TH"} {"id":"1904.08540v1","submitted":"2019-04-17 23:57:19","updated":"2019-04-17 23:57:19","title":"Matrix Completion With Selective Sampling","abstract":" Matrix completion is a classical problem in data science wherein one attempts\nto reconstruct a low-rank matrix while only observing some subset of the\nentries. Previous authors have phrased this problem as a nuclear norm\nminimization problem. Almost all previous work assumes no explicit structure of\nthe matrix and uses uniform sampling to decide the observed entries. We suggest\nmethods for selective sampling in the case where we have some knowledge about\nthe structure of the matrix and are allowed to design the observation set.\n","authors":"Christian Parkinson|Kevin Huynh|Deanna Needell","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.08540v1","link_pdf":"http://arxiv.org/pdf/1904.08540v1","link_doi":"","comment":"4 pages, 4 figures","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|stat.ML"} {"id":"1904.09818v1","submitted":"2019-04-18 13:10:59","updated":"2019-04-18 13:10:59","title":"One DSL to Rule Them All: IDE-Assisted Code Generation for Agile Data\n Analysis","abstract":" Data analysis is at the core of scientific studies, a prominent task that\nresearchers and practitioners typically undertake by programming their own set\nof automated scripts. While there is no shortage of tools and languages\navailable for designing data analysis pipelines, users spend substantial effort\nin learning the specifics of such languages/tools and often design solutions\ntoo project specific to be reused in future studies. Furthermore, users need to\nput further effort into making their code scalable, as parallel implementations\nare typically more complex.\n We address these problems by proposing an advanced code recommendation tool\nwhich facilitates developing data science scripts. Users formulate their\nintentions in a human-readable Domain Specific Language (DSL) for dataframe\nmanipulation and analysis. The DSL statements can be converted into executable\nPython code during editing. To avoid the need to learn the DSL and increase\nuser-friendliness, our tool supports code completion in mainstream IDEs and\neditors. Moreover, DSL statements can generate executable code for different\ndata analysis frameworks (currently we support Pandas and PySpark). Overall,\nour approach attempts to accelerate programming of common data analysis tasks\nand to facilitate the conversion of the implementations between frameworks.\n In a preliminary assessment based on a popular data processing tutorial, our\ntool was able to fully cover 9 out of 14 processing steps for Pandas and 10 out\nof 16 for PySpark, while partially covering 4 processing steps for each of the\nframeworks.\n","authors":"Artur Andrzejak|Oliver Wenz|Diego Costa","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.09818v1","link_pdf":"http://arxiv.org/pdf/1904.09818v1","link_doi":"","comment":"7 pages","journal_ref":"","doi":"","primary_category":"cs.SE","categories":"cs.SE|cs.DC|cs.HC|cs.PL"} {"id":"1904.09378v4","submitted":"2019-04-20 00:21:59","updated":"2020-03-09 03:15:26","title":"PersLay: A Neural Network Layer for Persistence Diagrams and New Graph\n Topological Signatures","abstract":" Persistence diagrams, the most common descriptors of Topological Data\nAnalysis, encode topological properties of data and have already proved pivotal\nin many different applications of data science. However, since the (metric)\nspace of persistence diagrams is not Hilbert, they end up being difficult\ninputs for most Machine Learning techniques. To address this concern, several\nvectorization methods have been put forward that embed persistence diagrams\ninto either finite-dimensional Euclidean space or (implicit) infinite\ndimensional Hilbert space with kernels. In this work, we focus on persistence\ndiagrams built on top of graphs. Relying on extended persistence theory and the\nso-called heat kernel signature, we show how graphs can be encoded by\n(extended) persistence diagrams in a provably stable way. We then propose a\ngeneral and versatile framework for learning vectorizations of persistence\ndiagrams, which encompasses most of the vectorization techniques used in the\nliterature. We finally showcase the experimental strength of our setup by\nachieving competitive scores on classification tasks on real-life graph\ndatasets.\n","authors":"Mathieu Carrière|Frédéric Chazal|Yuichi Ike|Théo Lacombe|Martin Royer|Yuhei Umeda","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.09378v4","link_pdf":"http://arxiv.org/pdf/1904.09378v4","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.CG|cs.LG|math.AT"} {"id":"1904.09609v1","submitted":"2019-04-21 14:32:42","updated":"2019-04-21 14:32:42","title":"TiK-means: $K$-means clustering for skewed groups","abstract":" The $K$-means algorithm is extended to allow for partitioning of skewed\ngroups. Our algorithm is called TiK-Means and contributes a $K$-means type\nalgorithm that assigns observations to groups while estimating their\nskewness-transformation parameters. The resulting groups and transformation\nreveal general-structured clusters that can be explained by inverting the\nestimated transformation. Further, a modification of the jump statistic chooses\nthe number of groups. Our algorithm is evaluated on simulated and real-life\ndatasets and then applied to a long-standing astronomical dispute regarding the\ndistinct kinds of gamma ray bursts.\n","authors":"Nicholas S. Berry|Ranjan Maitra","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.09609v1","link_pdf":"http://arxiv.org/pdf/1904.09609v1","link_doi":"http://dx.doi.org/10.1002/sam11416","comment":"15 pages, 6 figures, to appear in Statistical Analysis and Data\n Mining - The ASA Data Science Journal","journal_ref":"Statistical Analysis and Data Mining -- The ASA Data Science\n Journal, 2019, volume 12, number 3, pages 223-233","doi":"10.1002/sam11416","primary_category":"stat.ML","categories":"stat.ML|astro-ph.HE|cs.CV|cs.LG|stat.AP|stat.ME"} {"id":"1904.10016v1","submitted":"2019-04-22 18:20:38","updated":"2019-04-22 18:20:38","title":"The Profiling Potential of Computer Vision and the Challenge of\n Computational Empiricism","abstract":" Computer vision and other biometrics data science applications have commenced\na new project of profiling people. Rather than using 'transaction generated\ninformation', these systems measure the 'real world' and produce an assessment\nof the 'world state' - in this case an assessment of some individual trait.\nInstead of using proxies or scores to evaluate people, they increasingly deploy\na logic of revealing the truth about reality and the people within it. While\nthese profiling knowledge claims are sometimes tentative, they increasingly\nsuggest that only through computation can these excesses of reality be captured\nand understood. This article explores the bases of those claims in the systems\nof measurement, representation, and classification deployed in computer vision.\nIt asks if there is something new in this type of knowledge claim, sketches an\naccount of a new form of computational empiricism being operationalised, and\nquestions what kind of human subject is being constructed by these\ntechnological systems and practices. Finally, the article explores legal\nmechanisms for contesting the emergence of computational empiricism as the\ndominant knowledge platform for understanding the world and the people within\nit.\n","authors":"Jake Goldenfein","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.10016v1","link_pdf":"http://arxiv.org/pdf/1904.10016v1","link_doi":"http://dx.doi.org/10.1145/3287560.3287568","comment":"","journal_ref":"Proceedings of the 2019 Conference on Fairness, Accountability,\n and Transparency","doi":"10.1145/3287560.3287568","primary_category":"cs.CY","categories":"cs.CY|cs.CV"} {"id":"1904.10416v1","submitted":"2019-04-23 16:45:07","updated":"2019-04-23 16:45:07","title":"Regression-Enhanced Random Forests","abstract":" Random forest (RF) methodology is one of the most popular machine learning\ntechniques for prediction problems. In this article, we discuss some cases\nwhere random forests may suffer and propose a novel generalized RF method,\nnamely regression-enhanced random forests (RERFs), that can improve on RFs by\nborrowing the strength of penalized parametric regression. The algorithm for\nconstructing RERFs and selecting its tuning parameters is described. Both\nsimulation study and real data examples show that RERFs have better predictive\nperformance than RFs in important situations often encountered in practice.\nMoreover, RERFs may incorporate known relationships between the response and\nthe predictors, and may give reliable predictions in extrapolation problems\nwhere predictions are required at points out of the domain of the training\ndataset. Strategies analogous to those described here can be used to improve\nother machine learning methods via combination with penalized parametric\nregression techniques.\n","authors":"Haozhe Zhang|Dan Nettleton|Zhengyuan Zhu","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.10416v1","link_pdf":"http://arxiv.org/pdf/1904.10416v1","link_doi":"","comment":"12 pages, 5 figures","journal_ref":"In JSM Proceedings (2017), Section on Statistical Learning and\n Data Science, Alexandria, VA: American Statistical Association. 636 -- 647","doi":"","primary_category":"stat.ML","categories":"stat.ML|cs.LG|stat.ME"} {"id":"1904.10559v2","submitted":"2019-04-23 22:36:56","updated":"2019-05-07 22:49:47","title":"Neutrino Oscillations in a Quantum Processor","abstract":" Quantum computing technologies promise to revolutionize calculations in many\nareas of physics, chemistry, and data science. Their power is expected to be\nespecially pronounced for problems where direct analogs of a quantum system\nunder study can be encoded coherently within a quantum computer. A first step\ntoward harnessing this power is to express the building blocks of known\nphysical systems within the language of quantum gates and circuits. In this\npaper, we present a quantum calculation of an archetypal quantum system:\nneutrino oscillations. We define gate arrangements that implement the neutral\nlepton mixing operation and neutrino time evolution in two-, three-, and\nfour-flavor systems. We then calculate oscillation probabilities by coherently\npreparing quantum states within the processor, time evolving them unitarily,\nand performing measurements in the flavor basis, with close analogy to the\nphysical processes realized in neutrino oscillation experiments, finding\nexcellent agreement with classical calculations. We provide recipes for\nmodeling oscillation in the standard three-flavor paradigm as well as\nbeyond-standard-model scenarios, including systems with sterile neutrinos,\nnon-standard interactions, Lorentz symmetry violation, and anomalous\ndecoherence.\n","authors":"C. A. Argüelles|B. J. P. Jones","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.10559v2","link_pdf":"http://arxiv.org/pdf/1904.10559v2","link_doi":"http://dx.doi.org/10.1103/PhysRevResearch.1.033176","comment":"11 pages, 4 figures","journal_ref":"Phys. Rev. Research 1, 033176 (2019)","doi":"10.1103/PhysRevResearch.1.033176","primary_category":"quant-ph","categories":"quant-ph|hep-ex|hep-ph"} {"id":"1904.11907v1","submitted":"2019-04-26 15:48:56","updated":"2019-04-26 15:48:56","title":"Evaluating the Success of a Data Analysis","abstract":" A fundamental problem in the practice and teaching of data science is how to\nevaluate the quality of a given data analysis, which is different than the\nevaluation of the science or question underlying the data analysis. Previously,\nwe defined a set of principles for describing data analyses that can be used to\ncreate a data analysis and to characterize the variation between data analyses.\nHere, we introduce a metric of quality evaluation that we call the success of a\ndata analysis, which is different than other potential metrics such as\ncompleteness, validity, or honesty. We define a successful data analysis as the\nmatching of principles between the analyst and the audience on which the\nanalysis is developed. In this paper, we propose a statistical model and\ngeneral framework for evaluating the success of a data analysis. We argue that\nthis framework can be used as a guide for practicing data scientists and\nstudents in data science courses for how to build a successful data analysis.\n","authors":"Stephanie C. Hicks|Roger D. Peng","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.11907v1","link_pdf":"http://arxiv.org/pdf/1904.11907v1","link_doi":"","comment":"16 pages","journal_ref":"","doi":"","primary_category":"stat.OT","categories":"stat.OT|stat.AP"} {"id":"1904.12320v1","submitted":"2019-04-28 13:29:49","updated":"2019-04-28 13:29:49","title":"Real numbers, data science and chaos: How to fit any dataset with a\n single parameter","abstract":" We show how any dataset of any modality (time-series, images, sound...) can\nbe approximated by a well-behaved (continuous, differentiable...) scalar\nfunction with a single real-valued parameter. Building upon elementary concepts\nfrom chaos theory, we adopt a pedagogical approach demonstrating how to adjust\nthis parameter in order to achieve arbitrary precision fit to all samples of\nthe data. Targeting an audience of data scientists with a taste for the curious\nand unusual, the results presented here expand on previous similar observations\nregarding expressiveness power and generalization of machine learning models.\n","authors":"Laurent Boué","affiliations":"","link_abstract":"http://arxiv.org/abs/1904.12320v1","link_pdf":"http://arxiv.org/pdf/1904.12320v1","link_doi":"","comment":"","journal_ref":"","doi":"","primary_category":"cs.LG","categories":"cs.LG|cs.DM|cs.GL|cs.IR|stat.ML"} {"id":"1904.12967v1","submitted":"2019-04-29 21:55:30","updated":"2019-04-29 21:55:30","title":"Astro2020 Science White Paper - Quasar Microlensing: Revolutionizing our\n Understanding of Quasar Structure and Dynamics","abstract":" Microlensing by stars within distant galaxies acting as strong gravitational\nlenses of multiply-imaged quasars, provides a unique and direct measurement of\nthe internal structure of the lensed quasar on nano-arcsecond scales. The\nmeasurement relies on the temporal variation of high-magnification caustic\ncrossings which vary on timescales of days to years. Multiwavelength\nobservations provide information from distinct emission regions in the quasar.\nThrough monitoring of these strong gravitational lenses, a full tomographic\nview can emerge with Astronomical-Unit scale resolution. Work to date has\ndemonstrated the potential of this technique in about a dozen systems. In the\n2020s there will be orders of magnitude more systems to work with. Monitoring\nof lens systems for caustic-crossing events to enable triggering of\nmulti-platform, multi-wavelength observations in the 2020s will fulfill the\npotential of quasar microlensing as a unique and comprehensive probe of active\nblack hole structure and dynamics.\n","authors":"Leonidas Moustakas|Matthew O'Dowd|Timo Anguita|Rachel Webster|George Chartas|Matthew Cornachione|Xinyu Dai|Carina Fian|Damien Hutsemekers|Jorge Jimenez-Vicente|Kathleen Labrie|Geraint Lewis|Chelsea Macleod|Evencio Mediavilla|Christopher W Morgan|Veronica Motta|Anna Nierenberg|David Pooley|Karina Rojas|Dominique Sluse|Georgios Vernardos|Joachim Wambsganss|Suk Yee Yong","affiliations":"Jet Propulsion Laboratory, California Institute of Technology|City University of New York and the American Museum of Natural History|Universidad Andres Bello|University of Melbourne|College of Charleston|US Naval Academy|University of Oklahoma|Instituto de Astrofisica de Canarias and Departamento de Astrofisica, Universidad de la Laguna|University of Liege|Univ. de Granada|Gemini Observatory|University of Sydney|Center for Astrophysics, Harvard University|Instituto de Astrofisica de Canarias|US Naval Academy|Universidad de Valparaiso|Jet Propulsion Laboratory|Trinity University|Ecole Polytechnique Federale de Lausanne and LSSTC Data Science Fellow|STAR institute, University of Liege|University of Groningen|University of Heidelberg|University of Melbourne","link_abstract":"http://arxiv.org/abs/1904.12967v1","link_pdf":"http://arxiv.org/pdf/1904.12967v1","link_doi":"","comment":"White paper submitted to Astro2020 decadal survey; 7 pages, 3 figures","journal_ref":"","doi":"","primary_category":"astro-ph.GA","categories":"astro-ph.GA|astro-ph.IM"} {"id":"1904.12968v1","submitted":"2019-04-29 21:56:27","updated":"2019-04-29 21:56:27","title":"The Most Powerful Lenses in the Universe: Quasar Microlensing as a Probe\n of the Lensing Galaxy","abstract":" Optical and X-ray observations of strongly gravitationally lensed quasars\n(especially when four separate images of the quasar are produced) determine not\nonly the amount of matter in the lensing galaxy but also how much is in a\nsmooth component and how much is composed of compact masses (e.g., stars,\nstellar remnants, primordial black holes, CDM sub-halos, and planets). Future\noptical surveys will discover hundreds to thousands of quadruply lensed\nquasars, and sensitive X-ray observations will unambiguously determine the\nratio of smooth to clumpy matter at specific locations in the lensing galaxies\nand calibrate the stellar mass fundamental plane, providing a determination of\nthe stellar $M/L$. A modest observing program with a sensitive, sub-arcsecond\nX-ray imager, combined with the planned optical observations, can make those\ndeterminations for a large number (hundreds) of the lensing galaxies, which\nwill span a redshift range of $\\sim$$0.25