Data Gov Transcript
Transcript of lotico session Data Gov - Bringing Government and Scientific Data to the Web
Date: Sep 2, 2010 · 6:00 PM
Gale A. Brewer:
Thank you very much. This is a real honor to be here and I'm certainly not going to be half as knowledgeable as the speakers who come after me, but I will try to give you a picture of what's going on in city government particularly from the legislature side. I've been in the city council since 2002 and I've always worked in government or a little bit in the private sector but always believed that government data should be public and that's almost an oxymoron because so much of it is not. So even years ago I built the first city website from scratch and even put up some little preliminary kiosks years ago in front of city hall. it had to be temporary because otherwise the landmarks commission would get upset. So you always have to work around things.
But I've always believed in this idea of public information should be public and when I came into the council I was head of made head of the u at that point it was a subcommittee on technology and then for the next eight years it was one of the 40 city council committees. There are actually 51 members of the city council for those of you who don't have the pleasure of spending all your time in city hall. there are 51 members and the budget of the city now is now 63 billion. There are about 303,000 employees. It's the fourth largest budget in the United States. United States budget is bigger. State of California, state of New York, and then New York City. So it's bigger than Massachusetts and Cal and Texas and Florida and so on. So it's really like a and half the countries of the world were bigger than half the countries of the world. So technology plays a role. Needless to say, I think government is always slower. You know that from those of you who either work for government or work with government or try to get information. So for the last eight years, we've been holding hearings when really nobody had ever had hearings on technology in government before. We had hearings right after 911 about how technology helped as much as was possible in that horrific situation in the aftermath tracking and Verizon got back up pretty quickly and all the things that 9/11 had to deal with and all we've had hearings on of course the maps and we've had hearings on clouds and we've had hearings on spectrum and we've had hearings on the everything the franchises with Verizon and the cables. I mean there's no topic we haven't kind of looked at but the one that and broadband we brought passed a bill that said that we had to have hearings in all five buraus to see whether there was or was not accessible lowcost broadband in the five buraus and of course there's lots of it but it's expensive and not very fast and so it's sort of like a a a mish mash of different topics, but the one that is always hiding I think a little bit is the data. there is in the city anybody here work for the city of New York? Are there people besides So there's a couple of us couple of us. the city of New York does have a very vibrant I guess it's a closed meeting but I go fairly regularly and it's all of the CIOS in the different agencies who meet on a regular basis and it really amazing I think because these are the men and women who have to make things work for this huge city and you know they're they're under represented in the you know newspapers on the positive note every day. But there I don't know how many we have 80 city agencies probably more but the official ones are 80 and so there's probably there's obviously many people and this group comes together I think it's called the municipal data council and they come together with the commissioner of do it and other agencies on a regular basis to talk about what their issues are and what their challenges are and how they can work together and how things aren't working and so on and there's a big push now to try to make more centralized because the agencies police departments the worst but don't tell anybody I said that they're very siloed you know they want to do everything in their own little world and I'm sure that's true in academia but it's really true with city agencies so given that whole backdrop and based on other cities we're always looking New York City thinks we know everything but we do try to think particularly on the tech front are there other cities that are doing interesting things and I think on the issue of open data No, that's true. Now, the mayor, uh, Bloomberg did, as you know, a couple of times, at least once, and I think there's another one coming up, and some of you may have participated, um, you know, had an apps conference. I went to the opening, uh, of one, and it was it was exciting. The only problem for me, and I'm just going to speak for myself, was it was all based on parking spots and I don't know, God, things that, you know, they're important, but I'm interested in how you can help New Yorkers. Maybe getting the parking as part of parking. But anyway, the issue was the data that was available was only those that either the city thought was easily lowhanging fruit or one that u somebody who was doing the apps had requested. Now if you don't it's like anything else if you don't know the data is there how are you going to request it? So it was it's a very small subset of all the data that's available. So we introduced a bill that said that the data has to be available.
We did it in 2009 and we did it again when we started a new session in 2010. So now we're actually had a hearing in June this of this past June and the issue is whether we can pass intro 29 of 2010. And what this legislation says is that the city of New York needs to make its data available in a format that is accessible to the general public, not in some formula that nobody can understand, even with the apps. And some of you may have won the contest or participated in the contest. You had to be one of you in the room to understand how to make it something that the public could understand. So we I obviously my background is human services and housing and things like that. So I want something that people working in the nonprofit sector can use to figure out where's the affordable housing how many homeless do we really have and what are their needs and so on and so forth. So at this moment just to give you an update and I can go through the bill in a minute but at this moment we've had a couple of meetings with the new commissioner Carol Post. she came from the office of operations. She's been there, I think, since December of last year, and she did say in her opening statement to the city council that she wants this bill to pass. Now, of course, you always hear that and you want to make sure that it's this bill or something that's much too narrow, in which case it's not this bill, but to her credit, she has had meeting with us, meaning the city council, and she has met with all of the so maybe 80 or fewer agencies, but all the agencies, and she's giving them time, a little bit of time, not much, this fall to meet goals or to come up with goals. so they can have open data. And the idea there would be she's got certain criteria that they have to meet and the agencies have deadlines. and when we're going to meet again, I think it's either the end of September or the beginning of October to figure out uh which agencies are meeting their goals and their timelines and which are not. Now, there are some agencies, and you probably know this, when you call 311, has anybody ever called 311? You probably Okay, good. Sometimes it works and sometimes it doesn't. But it is generally u an amazing talk about databases that is an amazing database. but those sometimes the agency can quickly respond and sometimes it's a legacy system and it has to go through a whole bunch of channels before a you get an answer but b you get data. So she's dealing with some agencies that have some very old hardware and is not compatible with anything that's helpful. So she's trying to figure out how to make this data accessible in the broadest format. And of course we're pushing very very hard. Secondly, costsaving. Anything that saves money in today's world is a good thing. If you haven't foiled freedom of information law, you may be the only New Yorkers who've never foiled because many, many New Yorkers file. Reporters file, journalists file, upset New Yorkers file. Um, I used to work in a city agency that was always getting foiled.
And that is a very timeconsuming response because often it's given to be honest with you to a low-level intern.
And that intern has to gather all the material that the upset person or journalist is interested in and supposedly gets back to you in a certain time period, but you can always get an extension. So it's my just one example and it's very timeconuming because you got 80 city agencies often corporation council has to get involved if it's a more high level request and of course what is responded to and what's blacked out is obviouslyimportant. Um, so the fact of the matter is that I think that if you have a lot of this data up on the web in a format that people can understand that you will not get as many foils and you won't have to be answering all of these endless and sometimes important and sometimes there'll still be foils, but it's a cost-saving measure and I want to add that because people those of us who just want the data out there to be able to work on it and use it practically and to help save lives and make people's lives earlier here don't see it also as a cost-saving measure. So I think that's extremely extremely important. the other thing I want to mention is that there is you know there's a lot of interest in this legislation. Obviously people in the public sector who are writing about what the city is or isn't doing or interested in it. people from like Common Cause and the groups that work on Citizens Union and groups that work on public access have testified over and over in support. and I think what comes up again and again is what kind of format that the data has to be in. And that's something that some of you in the room might have some good ideas about and certainly something that the city is trying to take note of. That is definitely part of intro 29 of 2010 is having data that's in an accessible format. the other issue I think is you know how do you how does the city decide and how do we advocate to know which agencies will produce and which not it's always challenging with the police department. and I obviously were concerned about health records. Something that just sort of an aside some of you may have seen checkbook NYC. It was just put up a couple July 1st by controller John Louu and Checkbook NYC is an amazing amount of data. Not something that you need to manipulate at all, but it is every penny that the city spends every day. It's updated daily. It's about 35 billion that's up there now out of a budget of 35 billion. But this is about 35 billion. And it has you can search, you know, it says that the department of health has spent money on pharmaceuticals and doctors and pediatricians and that the mayor's office has bought liquor, but they're going to get reimbured for a party. And it says how many car services the board of elections has used when the workers go home late, etc., etc. For those of us who interested in gossip and, you know, things like that, it's fabulous. So for those of you who are interested in more mundane things like tech and things that a little bit more professional, it's also there. what's not there because they're working at it is the salaries of the workers in a because there's a lot of concern about that kind of privatization and they're worried about whether people would be stocked or home at home home names if that kind of thing is still not up there and it will be. But I all I I mention that first of all it's a fun site. You should look at it. But I also mention that because I think the city is finally getting the message that data that's public needs to be public. And so one part of it is where does the city spend their money?
Where does your tax money go? And already just the other day somebody did a story about car washing because it turns out the Daily News must have seen a police car being washed. I don't know how the reporter started on the story, but then he realized that there are like 10 different agencies and it's a hodgepodge of which government car is washed at which car wash. And then unfortunately somebody got a car wash for like $126 or something that's very expensive car wash. And so from there it was a story, but he got all of his information from putting the word car wash or some something like that into checkbook NYC. Um, so that's good for the public to know and hopefully we'll have some hearings and maybe there'll be vendor car washing in terms of the value for the dollar. But this particular legislation is is larger in the sense that it is looking at the data sets that you as people who know how to academics and your jobs figure out how you can take this data and make it useful to the public and that will never get done by government. Um, and so it's absolutely necessary. It's your data. It's your right to have that data. But at the same time, in this in today's world, what I think is so exciting about it, it can create jobs and it can create information and hopefully make people's lives better. So on so many different levels, this is a very exciting opportunity. how we get to that point is what we're advocating for as much as possible. and then just finally, you know, there's so many questions that
have to be answered. What's the record policy? How long do you keep the data for? Um, we're already running into some of those questions even without a large large amount available to the public. What are the technical standards? Um, is it based on agencies or is it across the board in terms of the standards? Is it one data portal, XML, raw data, support, RSS? Um, and you know, just something that's readable. And then of course I would be a big believer particularly in the nonprofit sector, you'd actually have to have some classes and some training. I think for some of the nonprofit sector people who could really use this data to be able to use it more effectively. I think that the Washington DC apps for democracy, some of you may know that has done has actually saved money and is doing the kind of work that we would like to see done in New York. A lot of times the updating is a problem. If you ever go to a government website, you will find at least in some cases, it's not updated as often as some of the ones in the private sector because there's obviously a perhaps a bigger motive to try to get the private sector ones updated. I find I always tell my staff, don't rely just on the web. You got to pick up the phone to get the real data because it's not necessarily going to be correct. We had a big challenge even with the 311 data. Um, the community boards. Has anybody ever been to a community board meeting? I hope somebody like three people have been to Oh, four have been to a community board meeting. I I' I've probably been to 5,000 community board meetings. So, just to give you an example of the challenging of of elected office, but there are 59 community boards in New York City. And for those of you not from New York, they're like little little city halls.
There's a staff of about three people and then there are 50 citizens. Anybody who is interested, you can apply to the B president and get on and if you're been to some meetings and you have an interest in putting some time into the neighborhood, you'll learn a lot. But there's a lot of data there. Um, you know, the street lights go out, the road needs paving, um, the applications for zoning, the applications for enclosed and unenced sidewalk cafes, and I could go on and on. at the same time and those often are either dealt with locally, passed on locally, but certainly have a knowledge locally. You even have situations where the movie people come in and if you're on the west side, everybody hates them even though they're great for the economy. How many of those permits are let on any given moment in Manhattan? Because we always want them to go to the Bronx or Brooklyn. Please go somewhere else. But you know, that kind of information would be so helpful to regular New Yorkers. why am I getting all the movies and I can't get a parking space? You know, that's the kind of thing that people really actually want to know. but the community boards are not in real time with 311. So you call 311 and the community board doesn't know it. So, we're trying to think of ways that community boards which are run by city employees can in fact have that real-time data because often this is an example of the importance of this data sharing because the community boards would like to know have a 100 people called about the street light or am I the only am I only getting one complaint or is are 100 people complaining about a road that's not paved or am I the only is the only complaint coming from me because if there are of people calling then we'd like to be more active in trying to solve the problem. So there's also something called scout. These people in the city I've never actually seen them. They run around little golf carts and they find problems but they don't share them in real time with the community boards. They give the data in centrally. So I'm giving you some examples of some of this huge data opportunity that's out there and that we'd like to capture with intro 29 in 2010. So, I I'm going to stop there. We can certainly talk about it later. I would love if people have the time to either write to the mayor or to Speaker Quinn to say that you'd like to have intro 29 of 2010, you can send an email passed in some form because you think that having government data be public is a good thing. Thank you very much. All right. Thank you so much for coming. I just want to make sure that you want to take one or two questions if you have one. I think it's interesting that you characterize the police department as being behind this sort of stuff because the public perception is that internally like blockby crime data and predictive analysis they're actually cutting edge on this. Yeah, I I wasn't correct. They're not behind. They won't share it. That's the problem. In other words, they're very the comat is excellent. obviously they're doing a great job on terrorism. So now they have 1,000 police officers as we're sitting here on computers. So that could be everything from comat to looking at porno and dealing with those crazy people. Porno meaning not them but making sure that others are not doing it. Cutting and also looking at the issue of terrorism 1,000 officers. That's a lot of people. So, no, they're very and they have NYC win, which of course is the system that I didn't talk about, which is a citywide a first responder system wireless, and they're able now in the near future to have much more connectivity and instant real-time info in the cars. But they're the only
problem is they don't like to share anything. That's the little problem.
Marco Neumann:
All right. So, with that, we are heading over to Jim. Thank you so much for coming. Thank you. There's so much work to do. So, um, when is the next meeting? I will let you know then. I don't know if it's been set because we're starting a new session during the fall, but we'll let you know and then you can circle. Yes, we can. So, we're switching the screen. So, we have a maybe Evan Sandhaus, you want to say a few words about September 30th or you want me to do that now, right? I'll come second.
Evan Sandhaus
I'm Evan Sandhaus. I am the I'm I'm one of three assistant organizers of the semantic web meetup along with Marco and I am we're going to do a pretty cool event this coming this the first public announcement of it the meetup page isn't out there yet but on September 30th we're going to do a session in jointly with a new meetup called Hacks and Hackers. I don't know if any of you have heard of them or not. They're a they're a new meet up focused on the intersection of journalism and technology. the journalists are the hacks. We're the hackers. so it's called hacks and hackers. and we're going to do a joint session with them on the role of data and metadata in the collection and in the production and management of news. So we're going to have we're going to have speakers from right now could change but right now we have the New York Times, the Wall Street Journal, Hurst and AOL News lined up to speak and it's going to be a really really cool event. And I pretty sure we have a sponsor that's going to pay for drinks, too. So, it's going to be it's going to I if for no other reason than that, I I I hope you guys feel like this is going to be interesting. the announcement will go up on Meetup shortly as soon as I can get around to writing the description of the event. So, I hope to see you all there and should be good. Where is this going to be? All right. So, it's going to be at AOL's at at an AOL facility. I don't know exact. It's here in New York. We'll hear more about this.
Marco Neumann:
So again, so welcome, Jim.
Jim Hendler:
So I'm going to kind of wander through a few sites and things like that. So to to duplicate my talk on your own, the only thing you have to remember is data.gov, which is the big letters up there. And that's where I'm going to start. Um, if you look in the corner, most of you can read it up here where it says an official website of the United States government. So, I'm not showing you my stuff now. Well, I am, but I'm not
showing you primarily my stuff. This is a site run by the GSA out of the White House, out of OSP in part, out of several other agencies. The first act taken by Obama was to create a new CI was to hire the first ever US CIO and CTO. The US CIO started this idea that within the government there would be the release of data sort of like you just heard about from New York. the CTO is is involved in an even larger open data government aspect to also go beyond data to a lot of the documents a lot of the policies processes etc. So there's a huge amount of activity going on trying to figure out how to give people stuff and particularly the data set. Now this is the historically what happened is so Obama created this in May 21st.
This is 2010 now 2009. So a year ago May the Obama administration announced the creation of the data gov website with about 50 data sets on it. So about 50 government data sets have been made available and they were made available in various formats primarily though just raw data raw commaepparated variables some of them XML things like that. Over the next few months another I forget the exact a couple hundred more came on. Now, right about that time, the Obama administration issued what was called the open government directive, which said that every agency in the executive branch of the government, which is the great bulk of the agencies you've heard of, DoD, DOI, DO, bureau of agency of so and so, all those guys had to create find basically it it ramps in and it's a long story, so I don't have time to go through the whole thing, But essentially they had to identify some highv value data sets and there's a definition of highv value they had to release those data sets through data go so by roughly December January time frame there were about 1100 or 1200 data sets on data.gov now meanwhile over in Britain data.gov.uk UK was forming and data UK.
You can see data. UK if I do this data. UK. oh boy. These speeds are going to make this demo fun. So this is all live by the way. Nothing nothing on my sleeves. I should go to my PowerPoint with these speeds. you will notice if if a Whoops. See if I touch the side. Let's try. That's the advantage of having power, but it's not nearly as effective. Okay. Data go UK, official site of Her Majesty's government. you may recognize the RDF logo on it down there. so what happened is s let's see May, June, July. So so late summer of 09 Tim Berners-Lee was at a meeting with Gordon Brown, the prime minister of England and some other relatively well-known people. Bill Clinton was there. I'm told I wasn't there. Um, and they were kind of going around talking about things Britain could do to kind of get attention back and reassert
themselves as a world leader and all that stuff. And Tim said what he always says to governments, you know, release all your data. He said Gordon Brown said, "Great idea. Let's do it. Come back and tell me how." Tim said it was the scariest moment of his life. No one had ever said yes before. And and part of what motivated, of course, was that the US was doing it. So there's been kind of what Tim refers to as the friendly rivalry, what I call the war of 2010, but the the it's it's kind of cut and fun because for those of us who've been helping the governments, so at in Britain, the data sets are released from the beginning, many of them in RDF. Okay, they all must be in a machine readable format, but they've mandated RDF as the primary machine format. a semantic web from the get-go. Now, over on the US side, we were not mandating anything like that or doing anything along those lines. But what happened is is the laboratory at RPI that the three of us speaking tonight represent I moved to RPI from the University of Maryland where I was for about 20 years in 2007. We started what's called the Tetherless World Constellation. I'll say a couple minutes about that in a second. Deborah McGinness and Peter joined us over the next couple years. Dre Luciano has just joined us. So that's sort of the faculty of this center. We have about 20 30 grad students now depending exactly what you count staff faculty and what happened was we started looking at these data sets and said well why don't we do what the bricks are doing and start turning them to RDF and showing what these guys showing these guys what we can do. So my lab, we just sort of we didn't have any funding. We didn't have any we just, you know, had been saying for a decade now that if people would just release the data, we'd be able to mash it up. And we thought, well, maybe we'd see if what we were saying was really true and was this great great revelation to me that actually all this stuff we've been saying about the semantic web actually works. Uh, you know, there's more of you in the room than the first semantic web meeting I ever held. I mean, you know, it's really exciting. Um, that was an international semantic. Anyway, so what happened was for May 21st, 2010, the first anniversary of data gov, there was a relaunch of the site and the the CT the CIO of the Vet Kundra was looking for applications.
So he was looking for where has some interesting stuff happened and there had been a little data gov newsletter every week something go on and he started noticing that about half the issues of the data gov newsletter talked about some demo that had built at rpi been built at rpi so he asked me to come meet with him which was exciting I don't get called to the white house too often press nicer and and the upshot is we showed him what we were doing we had built over the course of about six months months using primarily undergraduates and graduate students who had never touched the semantic web before they joined the lab. 40some demos that mash up significant amounts of US government data. We also had built a converter to start converting data sets to RDF and had created about 6.4 billion triples out of the first few hundred first about thousand data sets we had converted. So I got asked to come in sort of more officially. I'm now a I'm not a let's see I'm not an adviser and I'm not a consultant because those two words have meaning. I'm an expert. That's what so I am the internet web expert for the data gov project. And what I've been doing is helping them look at some of this. Now meanwhile they also wanted a number that would beat the pants off the bridge. So we now have 272,000 data sets available for you on this site. Now it's worth noting that they come in three forms. So there's sort of raw data, tool catalog, and geo data. About 270,000 of the 272,000 data sets there are geo data. At the moment, they're not very well organized.
Searching for anything is very difficult. So a lot of what the the priorities are at data gov now is how do we make that stuff as much fun and as easy to play with as we did with the first set of data. But let me show you what was going on. So, um, so you can find your way to the group at RPI. We're called the Tetherless World Constellation. I won't go into that too much. So, now we're leaving the government site. Now, we're into the site at RPI that Deborah Peter and I run. Um, so this is our lab. There's various inundry there you can see. And what we're working on is sort of two things. One is of course a lot of semantic web. All three of us are known for semantic web technology and we continue that axis. But we're also looking at spawning out towards some other things. You'll hear about a couple of them today. What could we do when we start doing a lot of data stuff? what is the whole science of the web? I won't go into that one or I'll be here for another hour and these guys will get mad at me for not letting them speak. And supporting science. So I like to say you know what I've been looking at is very very broad data integration. What Peter will talk about later is more a small number of data sets whose size make make the entire government data release look tiny and Deb is sort of somewhere in the middle in terms of the technologies we need to really do these so the ontologies the provenence things like that so that's sort of tonight's talk so I've sort of segueed from halfway through my talk back to the introduction now back to my talk from the data gov so from our website which you can get to from data gov. You'll see a pointer to the data gov project or you can just remember data-gov.cw.rpi.edu and we we we heard what you were asking for Gail before you even said you needed it. So what we've been doing on this site is building mashups of government data. We started by first just building visualizations putting them to RDF figuring out what we could do. But the goal of course of RDF is the data integration and the data linking and and all that kind of data stuff beyond you know sort of looking at one database through one data set through one data portal. So you can see we've currently converted about 687 of the 2,769 data sets that are in that are not in the geo data sets. So the geo data have its own formats and things like that. We're looking now at how we're going to bring that stuff to the semantic web. Created about 6 6.39 billion triples.
Now those triples are also now available from data gov. I won't go back there and show you, but if you go down to the semantic web pane at data gov that you know so so we built a bunch of demos right if you come here you can see our demos but being an academic organization we actually want to do more than just build demos which is we want to make it so people could build their own demos. problem is I can't show you all of our demos tonight because this thing only has IE IE isn't friendly to Google visualizations. There's a big fight as you know between Apple and Flash. Less known is the fight between Microsoft and Google. And it all depends on, you know, sort of what you're going to be able to see for what kind of phone when. But but if you got Firefox or Safari and you go here, you'll see some some stuff in these demos that you'll see partial visualizations here. So occasionally I'll click on this and nothing will happen. I don't know which ones are friendly or not, but but this is a good example. This is actually our best known example today for an interesting reason I'll show tell you about in a second. So what you're seeing here is ozone sensors. So the the government released a data set of for a whole bunch of sensors what were the ozone levels being reported by the sensor. The size of the circle is the average based on the average value. So the glance you can see this. Now the interesting thing about this data set is it didn't say where the sensors were right it just had the sensor. So this data set just had the values and the external key was a sensor. Okay. We did a few web searches.
We found there was another EPA site that actually had the had the locations. So obviously mashing up the sensor values with the sensor locations let put them on the map. We also know a lot about the terrain in the world and things like that because that data set is also available elsewhere. So what you have on the side here is a a faceted browser of those data sets. so here are the ones that are on mountain tops, things like that. But some of them were from EP these are the ones that the EPA maintains. These are the ones that the National Park Service maintains, things like that. So we had to get some stuff from an EPA website, some stuff from a park service website. If you go into one of these, what we can do being it's, you know, all web stuff, this will let us actually is terrible. so what I'm doing now is I'm doing a sparkle query against an endpoint. This one is actually running since I launched it from our site on our endpoint. If you go to the data gov site and click on semantic, you'll see about seven of these 43 demos. they're running on their own virtuoso server. So they're using their own triple store locally for the government. Um, and they also host, as I said, that you can get gzipped files of all of the triples that we've converted. Uh, Kingsley Idaho, for those of you who know him, those of you who are insiders in and this stuff know him. Uh, he hosts them all live and also host the links between them and the rest of the, uh, the link data linked open data cloud. So you about 13 billion triples of linked data out there. Six billion are from this project so far.
When we start going after this 270,000 data sets I don't even want to think how many triples it's going to the metadata they have very skimpy metadata we estimated just to convert the metadata would give us another few hundred million triples okay my guess is what we're seeing here is an IE incompatibility so what I'll do is I'll go back to the IE friendly version back at the real datab show you a couple other demos. So again, if you click on the semantic web tab, oh, so let me mention for a second community these are the organizations working most closely with data. Open gov is is a larger government initiative or RPI, the Sunlight Foundation, the World Bank has also done a major data release. One of the things we're working on now is hooking up some of their things. These are the US states that have published data out there. You can see them on my map somewhere. I guess you can't see states. These are the countries that have done data release. So far, the US and the UK are the only two that have native where you can actually get it in semantic web formats natively. However, for all the others, if you want them, the converter we built, which is available through our website, is available for you to take whatever you want and play with. Let me just show you what some of these demos look like. So again, I was showing you CastNet was this one. See if their site will actually work better. So what happened is when we ported our stuff over their site because so many government users have IE only. Where's reload?
It's next to the So that one that one's odd. That one is actually a Google error. The problem there is coming as best we can tell from the data set. F5 good. Okay. So this is so what happened here is we went off to a different and EPA site. So of course now suddenly the data can be linked to other web stuff. It's URIs of URIs. You can see it. So the mashup. So what we have now is we've take So if you're counting, there's the EPA site, the National Park Service site, the EPA data set, that's the primary sensor values. Now we're off to EPA and NPS sites themselves. So we can link out to them. And the other thing we can do is get down to the raw data readings. Raw data readings. Again, we're we're querying. These are these are pretty big on the the internet's slow here, but this usually takes about three four seconds. So, I'm not sure what's going on. Uh, again, I don't know what's exactly I friendly, what's not, but so what's what you should be seeing here is a nice timeline slide. I'll show you a different. Okay. Okay. So, some of the other demos that they've picked up is so here's this just state library books by state. If we click on one now, you can't see the graphs. It would launch a graph that would show you some information about it. I was really bad. I'm I'm sorry. I mean that is not an official policy of the US government. they like I so here you know so here what what we have here is this is broadband adaptation in urban versus rural areas for the different states there's a color coding here that you can sort of see on the map we go to here's here's a one this is global foreign aid by the US and if you go to our site you can actually see the actually this one's probably dispatching to our site yeah it is Um, Peter, mind me to kill D when we get home? Uh, I don't know what's happening. Someone's playing around with our our end point, I think. Oh, here we go. All right. So, um, there's a map here somewhere for this. This one actually is an IE problem, but um, okay. So for example if we want to see what the aid to what's good India is good so India has been funded US aid this is what's been done by the department of agriculture department of state their categorizations now one of the things that's really interesting about this stuff is that the visualizations I've been showing you are all just standard APIs that are supported by Google some of them are the exhibit API from MIT we've also gone out to Yahoo types, Yahoo Boss.
So lots and lots of people now are building sites where if you give them an XML page, they will draw it to to to one of these as long as you just format things right. And basically to get to the formatting right, we we've built a bunch of stylesheets. All of those are available. So what I was trying to say before is if I go back to our website primary site and I'll finish with this You can see we also have for example a bunch of
tutorials. So if you want to learn how to do all the stuff I've just shown you, granted where most of these are written by graduate students, undergraduates, me, you know, a lot of illiterate people. so they're not, you know, quite in the format you'd get if you went to a professional site. But how to build these things? if you want to know you know how our endpoints work and things like that this is some of the things about gov these are some of the external things we'll make available to you if you'd like to see some videos about creating the website you know here's this is a undergraduate from Bennington who came to our lab for three months is a political scientist never touched computer science before built a whole lot of really cool things working with my things maps with Twitter feeds over them and things like that so You could see broadband versus so we had what the the broadband adaptation adapt broadband adoption in various states in the US versus how many Twitters were coming from those various states things like using their geo stuff. I mean all this stuff is easy. That's the point I'm trying to make. In fact last week we we held the first US government mashathon. They didn't like the term hackathon. Where we had is is 30 30 civil servants mo half civil servants half support people for them who came in. We taught them in in a day and a half how to build their own demos. And I you know I should have loaded it up. I'd have to do a little bit of navigation to find it. But we have now have five or six new demos that were built by government people in a two-day period. They did need a little tutoring. but again we're we're we're not at rocket science anymore. The semantic web stuff has reached a point of maturity. You can get your hands on it. we have a lot of stuff about how to do it. You can cut and paste. So in in the demos every demo has if you go to it oh here it's evidence here. I don't know if this this one actually won't show up with live because of the IE issue. What we can do is we're using some of the New York Times API. So here what you have is the budget of an agency and here what you have is New York Times stories about those agency budgets and now we're working on New York Times link data to make it so for people and things like that we can get the reput you know the so this is just a keyword search there's now a better way to interact with their stuff. So again, so if you want to know why in this particular time, I don't know what agency this is, the American Battle Monuments Commission, right? What were they doing? Well, this was something about refurbishing the Vietnam Memorial. So you can see where that money was being spent. Um, for each of these demos, what you can see is we have what technology we use. Here's the actual sparkle query, right? So if you want to cut and paste that against our endpoint or against another endpoint, you can. Lots of semantic data there. So again, so I invite you to come play. Lots of government data. we're getting better and better at this. Our our goal in this project now is to make it so developing one of these government mashups is roughly the same level of complexity as creating a website, right? We'd like it to be as turnkey as that. I mean, you know, so it's not necessarily that every end user can do it, but certainly every web master should be able to do this. Certainly everyone in an organization who has enough literacy to understand the underlying, you know, so if you're someone who who looks at a web page and can't tell that there doesn't know what the what that there's some kind of machine readable format down there that presents it to you, you know, you'll need a little education. But we we really think we can get there. We'reware where from scratch on your own playing with our stuff. Those of you who have any familiarity with this stuff will be able in a week or so to throw these things together. We're also working on a mobile app framework and some other things like that to make it even easier. So again, the the goals of this stuff is to show so so traditionally in the government, I'm sure Gail can testify to this, the way you do a demo on data is you go out with a procurement, you hire a contractor, you get the specifications right, they play with it for a year, they build an interface. You can't have their internal formats. They don't provide those back to the government because that's now a proprietary thing. So the government gives them the data that you and I paid for with our taxes. Company spends our tax money to make it so the government can't have the data back except through their application. Right? This is this whole data gut project is about breaking through that loop. And what I'm hoping you can see is we're trying to help you do that. Come to our site. Come play. come find ways to make money off it. The government would like nothing better than for some smart people to figure out how to make a lot of money off of this. We're going to be running some workshops. Evan mentioned hack and hack hackers. We haven't actually talked to them specifically yet, but I just last week got the green light. The first of the government to industry sector workshops. We'll be having four or five over the next year or two. We'll be with the media to look at how can we make metadata from these things more available so media folks who want to find a d government data set can get their hands on it. and how will we make it so that the media can use annotation that we'll be able to get back. So, for example, if you go to one of these data sites, be nice if it said, you know, here are the stories that have been published about, you know, using this data, that kind of thing. So, I'm going to stop there because I'm at my time. I told him to stop me short and he didn't and let Deborah start setting up and I guess I'll take a question while she gets her laptop.
Yeah. Okay. So, ties together I think the two presentations is that technically it sounds we're at the point where like you're saying web masters and tools and makers to do well. It seems like the challenge beyond like informing the media is to inform other important stakeholders like foundation funders, like government agency funders that are funding nonprofits and other research efforts to actually make it part of their funding mechanism to create data in a way that before people before this disappears. there is an app section here where you can see some of the neat neat things that have been built only a couple of them by my lab. Most of these are by actual you know companies and things. This is the apps for America stuff. Some of that's in here but but some of these are really interesting like fly on time things. So it's not only so all the people you mentioned but we do when I say end I don't I don't expect an end user to build a mashup. I expect the end user to be able to look at one of these mashups do. And just to tell you a quick story, when data goes, so two days before the data gov relaunch on May 21st, 2010, they held a a press conference and a woman from Wired came and said, you know, I understand what you're doing and this is really cool, but I'll believe it when it passes the grandmother test. My grandmother is pretty technically literate for grandmother, but I expect to see some stuff here that you know, granny will like. Well, this made them very nervous. But about 2 days after the release, they published a paper. You can find it. Look for like grandmother and data.gov or something like that. We won, right? And the and the demo that they really that grandma liked the most was excuse me was this one.
So this is the White House visitors list turned into mash data but you can't see it here because of some of the things but for example running down this side if you saw this in spar in in Firefox or Safari would be the DBPedia. So the Wikipedia information pulled through the DBPedia semantic web version about some of these people. So if you you would see a picture of Obama. So if we pick the vecundra okay so that's the vec pulled from a white house site. There's some there's some web search here news. Here's who visited him. If you click on them you'll go to their sites. You can see who they visited. we also have some social network stuff. So, grandma apparently looked at this, didn't know who VC was, and started to say, "Yeah, well that's okay." But then it turned about three below VC was Brian Orzag and Granny was a news junkie. So the, you know, he's cabinet level, so she was excited. So she could see who visited him, notice his grandkid, I guess they're not just green, his his kids visited him. She's like, "This is cool." So again, the end user should be able to navigate and use these apps as easily as any other apps. But but this, you know, but this app should be as easy to build as a website. And again, it's not quite there yet. But for those of you who are are sort of semly and have some background in this, pretty much everything's just offthe-shelf web technologies, PHP, things like that. I'm going to I'm going to stop to let let my colleagues have some time. All right. Thank you, Jim. And we'll be around all three of us will be around after to to do questions. I guess all four of us to answer questions.
Deborah McGuinness
Okay. My name is Deborah McInness and as Jim said I'm another constellation chair at the room polytenic institute tetherless world constellation and I joined RPI at the beginning well at the very end of 2007 after having been at Stanford for nine years running the knowledge systems lab and prior to that I was not too far away in New Jersey at Bell Labs and AT&T labs for 18 years after school. Um, and so what I'm going to talk about today is a couple two emerging trends that come out of Semantic Web and link data. And what I'm going to do is actually first set just a little bit of context. If you've ever seen the layer cake. So this is one of the instantiations of what Tim Berners-Lee made famous with the Semantic Web layer cake about how you start down here with kind of encoding standards and you get up to the point where you get user interfaces and applications that hopefully have trust and have some proof about what's going on. And what we're focusing on in this entire session is kind of in here. So from the the the data interchange level through encodings of some meaning and actually I'll I'll focus some on proofs and meta information about where the data came from and why you might believe it. So now we heard Jim give a nice talk and actually Gail give some nice motivation for having data on the web and if you look around at the web and its inclusion of semantics so both Jim and I have been in the semantic web since way before anybody used the phrase semantic web but it's clear that there's at least growing acceptance of semantics some notion of encoding meaning on the web And we see that trend more and more. So now that we've got the web and some inclusion of semantics plus some notion of social spaces. So when I started as a computer scientist, I got to live in a world where I might have been the expert about the data. I built a beautiful application that was carefully constructed. It got a while to do it. I knew my users. and it was kind of a little I sat in an ivory tower of Bell Labs or university setting. But now it's the wild west. you know, the data is all over the place. There's tons of people out there interacting and so that's great and that there's more contributors to data, but now the people who are going to use that data have a little bit more of a burden trying to sift through about is it authoritative? Maybe it's authoritative, but how recent is it? So now we've got the web and the inclusion of semantics, social spaces, and this massive tremendous potential of winking.
We're creating a lot of new opportunities and a lot of new challenges. So, I'm going to go through a couple of motivating examples that motivate some emerging trends and then my spoilers or my take-home message is that there's some next generation ontology issues that are emerging and that knowledge providence is growing in importance and I'll motivate that and then show you a path towards some solutions. So, one example that I've been leading recently with the National Institutes of Health with their population science division is trying to help their researchers ask questions like do policies like taxation on tobacco or smoking bans restaurants barring smoking or workplace barring smoking impact health either positively or negatively and how does that impact health care costs? So if you were a researcher trying to ask those questions, what kind of data do you want to see? And and if I'm the researcher, maybe I want to see one kind of data or maybe I want a different interface. And if I want to help inform the public, you know, ultimately this organization is trying to change people's behaviors. So they're basically trying to get people to stop smoking when we're working on the smoking issue and we're working on obesity as secondary things. So we're trying to get people to be at a healthy weight. how do we want to present data so that people might change their behaviors? So how do we make that data actionable and believable? And what data might we present so that people actually choose to make the right behavior changes. And then I'm going to show a few pictures. And actually I'm not going to do it live because I have I I rely too much on fancy graphics and IE is not adequate to show my graphics. I'm gonna have to show some static things. but what kind of data do you actually want to see? And then what are the appropriate follow-up questions? So this is a very new demo where we're trying we're actually exploring right now with the National Institutes of Health about what kind of data they want to see. So one set of data that we have it has a lot of information about um oops about bands and this is the policy within workspace workplaces. The particular year that this is showing is 2007. And so you can see that some states like California would have actually expected to be better, but this is actually they don't have complete coverage of policies and work places versus some other states like New York actually does. but you can see in California when they started when the the policies started to come into place. So basically they they came in a little bit of a way in starting in 1990 and actually California led the way here although you wouldn't know that except I've looked at the data but then in 1998 they kind of came in in a much bigger way. And then we also want to look at whether cost actually makes a difference and whether taxation makes a difference. And we might want to look at these trends and say, "Wow, that's a pretty steep increase around 1995." You know, maybe you want to go and look and see whether something was going on. And ultimately we one of the pieces of data that we the data sets that we have include prevalence. So how many people smoke? and actually what you want to see is smoking prevalence going down. And California is actually a pretty good state for that. And like this data point in here is kind of interesting. So you might go and say what's going on in 1995 or maybe what preceded this drop. and then what went on between 95 and 96 that it jumped back up again. So we want to help them look at the data, help them ask the right questions about what's going on. they actually put out a number of surveys to try to collect data. So we might actually also want to help them try to figure out what kind of data they should be collecting and we also might want to help them try to figure out where the holes are because one of the things that we particularly at RPI but anybody with access to the web is great at these days is the data is out there. There's a lot of data out there. So this data we see that we don't have any data in 2008 and 2009. Maybe we want to go get that. So maybe we want to identify gaps and maybe we want to identify these possibly interesting portions of graphs and maybe we want to look at questions like when I first looked at this a student put this together. I said wow taxation on cigarettes is actually going down. That seems weird. But then I read the fine print and it's adjusted for inflation. So actually it's not when you look at the data it's not actually going down.
So, it helps you see patterns. This was the one that's much cooler and I can't show it in IE, but these motion charts. Um, so they actually when we were working with them, they said, "Well, let's look at a couple of states that are representative of what we at NIH consider good policies, i.e., more bans on smoking and more taxation on cigarette smoking and lower prevalence of smoking. And then let's also so some of those states were like California and a few that are off of this. And then there's a couple of states like Alabama and Arkansas that represent light policy, light taxation, and high prevalence of smoking. And then let's actually look and see what actually is happening. And I can't show this live because of the the browser problems, but but I could if it were if I had Firefox or Safari. But you can work, you can play this and then you can see it over time and you can stop it at any particular time and click over any of these circles and see how many people were smoking and what the taxation rates were and what the tax was and what the smoking prevalence was. And if you ran that and let it go to completion, you would see some states take the lead. Like California was one of the leading states if this were an action. you'd see them coming out here with their ban coverage because they started everything. And then actually you see a lot of states start doing taxation on cigarettes. And you see some states like New Jersey with very significant ban coverage and very significant taxation rates. And what you can't see because it's static now. But what you really want to do is see the prevalence changing from a high prevalence, which is here indicated in blue, moving to a low prevalence, which is here indicated in yellow. And actually, you can see some of these circles change. So ultimately, if you're at the NIH, and probably many of us would like to see less people smoke. And so what we're trying to do is see is help people explore how we might get that behavior. and then we let us and them collaboratively look to see what might be causal and what might not be causal. and you might look at how things have changed over time. So the ban coverage zoomed up in 1998 in this particular state in California. and smoking prevalence, you know, went down, but it actually didn't go drastically down. But you can also see the taxation rates go up as well. So it lets them ask questions. so some of the questions are what kind of data do we actually want to display? So I started with we want to make people healthier. So we probably want to see reduction in lung cancer. We might want to see reduction in health care costs. That data is really messy. So we're not actually displaying that. But we do actually have decently clean data for prevalence of how many people are self-reporting if they smoke. because actually that's the largest data set that we have. So you know you want to go in and say well what what's the definition of prevalence and how are we measuring pre prevalence and is our data set the best data set for measuring prevalence and under what conditions do we get this data? So is this recent and how big's the sample set and are there extenduating circumstances that may or may not you know impact my use of the data and then do we need more data and do we want to make more inference. So these are just some of the drill down questions that I want to that we want to help them ask and then from a technology perspective how can we help? So Jim showed us some really nice demos and actually gave some nice perspective on the fact that we can actually build these demos pretty fast. Um, and what we really want to do is build not only build fast demos and not be not only have it so that we can build the demos, but so that this audience can build the demos and your customers can build the demos, but also make it so that those demos are understandable and so that they're actionable, so that they're potentially behavior changing. So there's some things that I as a technologist feel that I bring to bear on this. So one thing is I can help people understand what the terms mean so both in my communication of what prevalence is for example but also in trying to put data together saying if this person's using this term in this way and this person's using the same term in a different way maybe I don't want to connect it together also one enormous topic is where did that the providence so where did that information actually come from when should I rely eye on it. How recent is it? And another thing that didn't come up in Jim's talk, but comes up much more so when you look at the the longevity of these demos is how do we handle the fact that the data changes? How do we handle So NIH has given us, I think, five different dumps of the data? Do we just throw out the old data? Well, some of it was wrong and they're authoritative and that and for example, they said they had 101% coverage. Well, that seems like it's just wrong, so I should probably throw that out. But but in other cases, they said, well, this interpretation is the one that we believe now, and we had this other interpretation in the past. Maybe I don't want to throw out those previous interpretations because maybe we actually want to go back and look at that in the in the future. So, how do we handle changing demos, changing data, different presentations for different audiences, etc. So my oops I'm just looking at my time. Okay. So my two themes for my two emerging trends are one is get some encoding of the meaning and the other one is keep some encoding of where the information came from. And Marco gave mentioned that many people know me as Ms. Anontology. When I went to Stanford and took over the knowledge systems lab, I said, you know, I run ontologies are us. So, we build ontologies, we maintain ontologies, we disseminate them. And what is an ontology? And I was on a panel actually in 1999 where four ontology experts came together argued for about four hours over a lot of alcohol the night before about what we were going to agree as the definition of ontology. And we essentially came up with a spectrum.
Other people have generated ontology spectrums as well. This is one of the simpler ones that it goes from so essentially you're capturing meaning anywhere from simple level like a catalog entry so just a text string to a much more formal very specific definition of meaning like we might encode in first order logic or a higher order logic. And right now what we're seeing on the web is relatively inexpressive. So simpleish encodings of meaning like controlled vocabulary. So usually in a single language often English to just a small amount of information saying that like maybe this this shirt is a kind of clothing. So just a simple taxonomy to shirt is made of particular kinds of materials. So I might have properties made from and I might go on to more and more descriptions more and more complex descriptions about how terms relate to each other. And when we go into the science domain that Peter's going to talk about more in the next talk, we'll see more expressive encodings of meaning. But all along the spectrum you can get tremendous value. And so actually one of the themes is that ontologies and formal encodings of meaning that are computationally understandable are gaining traction. and another theme is that capturing some encoding even if it's just a natural language about where the information came from is incredibly important. So a definition that we like to use for provenence is the origin or source from which something comes. Intention for use. Who or what generated who or what the material was generated for. The manner of manufacturer. History of subsequent owners. Sense of place and time of manufacturer production or discovery documented in detail sufficient to allow reproducibility.
So that's just providence alone. Then pick your favorite definition of knowledge. I picked a few just to put up here but basically some kind of belief fact or condition of being aware of something and then put them together. So where did the knowledge actually come from? And now we've got knowledge providence. So why do you care about knowledge providence? So hopefully you got some sense of why the NIH uh
researchers might care. So how recent is this data? How reliable is it? When are they going to depend on it? Um, I've spent a career trying to make knowledge representation useful and so, uh, done a lot of sponsored research projects, interviewed tons of users, um, who essentially always say the same thing. You know, if you don't tell me why I should believe this, I'm not one going to fund you, and two, I'm not going to take action on the answers that your system is giving to me. So, I presume that we're going to make these viewraphs available. I just put them in there for a couple of uh articles that you could go look at where intelligence analysts say they're not going to listen to intelligence programs unless they can understand where the data came from. Intelligent assistance users aren't going to take the recommendations. virtual observatory users that actually Peter might have a few view graphs from an joint effort that we did on virtual observatories. They want to know where the data come came from and these are all fairly well documented with user studies showing that they're not going to accept it without that information. And there's tremendous growth in this area. So the worldwide web consortium which if you're going to look for standards on web technology that's the first standards body I look at. There's a providence incubator group. I'm on that group as well. There's some really nice documents that I really just included here so that you would have a place to go look for if you cared more about Providence. And there's a lot of quotes from all sorts of wonderful famous people like our friend Tim Bruce Lee who is famous for talking about oh yeah, you know, so if I show you something, he wants to be able to say oh yeah or why or huh. so basically that's what this whole line of research is about. So that at any time that you see something you can say why should I believe that or where did it come from. So the position that we take is that system transparency or being able to have that information about where it came from and why you might believe it and why you might disbelieve it supports understanding and trust and allows you to look at a system and and know when to take action on it. So one research goal one line of research that we're pursuing is to make interoperable infrastructure that supports this that supports explanations of everything sources assumptions answers etc. And we have built a lot of infrastructure that we've used in a wide variety of applications from um scientific virtual observatories that if you get this picture what did that picture rely on? where was the data set coming from? if you had amazing eyesight, you would be able to see in here some metadata about the weather conditions under which that data was collected. to intelligent assistance that you can say what are you doing and why? and it says it's waiting for approval. it says what it's doing right now and what rule it's following to one of my favorites because I'm a big wine and food person. to something that recommends wines with meals and why you might take that. So here it's recommending Braftoft Chardonnay for actually I don't see the question that it's being asked to recommend for but why you might take that recommendation and what information it relied on to intelligence settings of the data that it was actually coming from. Actually New York Times people might like this. So you can see the the portion of the document that it's actually relying on and then there's kind of a fancy reasoning system in the back end that's saying what it took out of that and why to taking recommendations on how to deconlict two airline paths that somebody had identified are likely to crash. So we might want to deconlict them and so this is a solution about how to resolve the conflict. So it's a wide range of applications. Oh to to prove combination. So you know might not want to just see one person or one system that believes a particular answer. But this is showing a lot of different systems that believe the same answer and it also will go and show the systems that believe the opposite of that answer. So, the thing that this all has in common is it's trying to let people look in at what's behind the scenes before they decide that they're going to buy that bankra Chardonnay or before they beat up the agent for why it's waiting, before they think it's broken so they can find out what it thinks it's doing or before you take this data set and start using it in your experiments, you might want to know more information about it. Okay. Okay. So, I just got three minutes. I'm going to go through. Okay. So and these days when we're doing this from a data set coming up with a demo using data from a particular agency what's really going on is we grab that data we transform it we revise it we probably archive it we make some deductions based on it we derive these pretty demos and then what we want to do is make it so that people can see when they want to use it and when they want to combine it. I'm going to skip through one of my application areas and there's backend encodings that help me also look at error conditions and oh Jim looked at this demo when in the previous talk. So, one of these applications was looking at the foreign aid and there were some questions about this demo. Um, so one of the things that you might want to do when you look at the demo is say I don't really believe that data. I'm not an expert on that data, but that seems to go against the things that I would have believed.
So, I might want to annotate it. So, hoping that an expert then will go in and look at it and see whether it's right or wrong. So now what we're doing is we're going in and annotating a number of these applications with information about where the data came from and allowing people to do things like ask question or actually this demo isn't annotated this way but one of the other ones is mashed up against in fact we should do this with our NR data against a New York Times data set that goes on in and says okay what was going on at this particular time period what's being reported in the New New York Times or any particular data set that you have access to that's going that's in this time period where I think it's interesting. So back to my original slides for the NIH where the taxation was zooming up. You know what was going on then? I finally talked to a tobacco researcher and said, "Well, that was the time that 46 states settled with the US government for having to support additional health care costs and all of a sudden taxation went up 45 cents a pack." You know, I didn't I didn't smoke. I didn't pay attention to that, but you know, that was an interesting event that I should go back and I don't know whether we've got data that far back from the New York Times, but we have some of the New York Times data set to see whether that would actually come out as a hypothesis for letting people look to see what's really going on there.
So, um, a few of the points that I was making today are one, we believe knowledge providence is critical for user acceptance in many settings. Um, so we love these demos, we love this technology, but if you don't know where the data came from and how reliable it is, uh, how are you going to know when to accept it? And, um, a point that I didn't make quite as deeply here, but, we've got a lot of data that supports having some encoding of meaning, uh, helps you interoperate with the data and helps you make the connections. And there's a reasonable amount of technology for supporting knowledge providence and ontology environments. Although it is a growing area of research and development and ultimately the open data initiative, the semantic web technology, it's changing the way I live. I think it's changing the way all of you live and I think it's going to change our futures and it's creating a lot of new opportunities. and when we put in this kind of technology, it creates even more opportunities for change and evolution.
So with that, I'll take questions.
Okay. So I have I have a question that kind of falls on the theme from the floor is a concrete example because it misleading when you're trying to draw a very simple assumption between taxation and federal.
Right. Right. So this is actually a fantastic topic that we could talk for hours and hours on of and actually my work with the NIH statisticians I think is actually kind of different than the the when I work just on the rest of the data.gov of applications because I work so directly with these statistitians and they've spent, you know, the last decade analyzing this data and they're all terrified that we're going to get that data out there and then we're going to mislead people or they're not really going to understand it. And actually fancy detailed analysis has gone into this. Plus, the data that we're working on isn't perfect. You know, it's survey data on whether you're willing to write down that you actually smoked. Um, so there's assumptions, there's weaknesses, there's incompleteness, and how do we display that in a way that's simple enough for people to understand it enough, but yet appropriate, you know, actually, uh, faithful to the methods and the data and the jury's out, you know, basically that's and but it's, um, it's imperative. So actually I've spent a lot of time in hospitals and nursing homes and medical informatics settings lately sadly for health family health reasons. But but one of the things that came out for me in this last year of this experience was if we don't remake our so I'm just looking at health at the moment. If we don't remake our education system on health and our medical informatics system and our environment, we're all in so much trouble. It's we we have no choice. We've got to do this. Plus, I didn't say it in these in these make it in these few graphs, but the instrumented future, it's an instrumented now environment that we live in, and we're just going to get more and more instrumented. This data is coming at us at a massive rate. Um, we've got to find ways to make sense of it in ways that might make sense to you as a PhD re researcher, might make sense to your 10-year-old who's trying to make a decision on whether to eat that donut or not. Um, and you know, kind of all levels in between. So, you know, I think these technologies can help and one of the ways that they can help is they can hide a lot of the detail in particular contexts and then they can let you drill down when you need more information. So I think this making available the context that the information was gathered in and the deductions that were done is critical. I'm interested in the rights of the research and how that travels through people's use and application. Is there this is something of common how this open data can be followed and not just open data but individuals and institutions creating their own sets. Unbelievably important topic and critical to our future and and you know there's no single answer. but but we've got to find ways to encode it, keep track of it, pass it on, make sure it's not stripped out, make sure that if you're the author, you get credit. make sure that the rights are preserved and also have access to protecting the information.
Right. One interesting application well not just application is that it can not only be followed so if it's an effective right really good and actually I don't know whether we're gonna have time for the panel afterwards but if we do this is a great topic to follow the presentation Stephy where can we go from here and learn more about this stuff do you offer training courses in RBI where people can you know learn more in detail yes Peter Peter will talk about this. We will actually. Awesome. Okay. Thank you.
Peter Fox:
All right. I'm Peter Fox. That's me right there in the deep south of Tasmania two years ago. Not from Tasmania, of course. And you should be worried because anytime the word revolution appears in the title of the talk, you can be sure that the fire hose is about to be turned on. And so I'm gonna, you know, I'm presuming there's not a strong scientific literacy in this audience, right? Anyone from a science background? Few. Okay, terrific. Um, so seriously, what is the fuss? Well, the fuss is that we have a complex Earth system. It's changing very rapidly, more rapidly than ever has been seen before. It's coupled the sources of data and information are just growing exponentially and you should all know that period bottom line and technical organizations for example Microsoft and Microsoft external research are an example of for many organizations paying amazing attention to this problem resulting in books like the thought paradigm which you can go and search on the web under did I say Microsoft yeah Microsoft research and the reason is that science is evolving this a slide due to Jim Gray from Microsoft and Alex Clay from Johns Hopkins astronomer and it basically says it's gone through four phases an empirical phase a theoretical phase and only 20 years ago did high performance computing came along and started to revolutionize how we've done science but only in the last five years has data exploration what we normally call ecience of data science really come to the forefront and it's a synthesis of things that have come before there's advanc enhanced data management requirements. There's data mining and pushing new the need for new algorithms which has really come to the forefront. So it's it's really feeding into a lot of areas of science, mathematics, physics and computer science. But as a scientist, our working premise is this one. this is a sort of a mantra that that Deborah and I put together a few years ago. Anyone should be able to access this global distributed knowledge base and have it appear to be integrated and locally available. And if it's not, it's actually going to be very very difficult to use because scientific data, if you think that government data is complex, scientists like to do a good job of obscuring their data. But we obtain the data and information by multiple means, different protocols, different vocabularies, stated or unstated assumptions, metadata, who knows? And the red thing is what I've been saying for about 20 years now. All data really is created in a form that facilitates its generation and not its use. That's why we're all in business. Actually, if data was created to facilitate use, we wouldn't be here. People would just use it.
Underlying that and and you've heard a little bit of it before is semantic heterogeneity, large scale data, complex data types, legacy systems, and so on and so on. But worse than that, there's a sort of a bigger problem and that is our two primary means of conducting science as illustrated here. The left shows the deductive reasoning approach where you start with a theory, come up with a hypothesis, compare it to some observations and see if you can confirm it. So this is sort of a typical model theoretical approach. on the right is the one that's facil not only facilitated by data because you start from the bottom with observations, you look for patterns, you come with a tentative hypothesis and try and resolve your theory. And so the way we've built all our information systems, scientific and non-scientific, funnel people directly along those two routes. So what about abduction? And no, not the criminal meaning, but it is a methodological inference. So key, that's semantics there. Semantics introduced by Charles Sanders Peirce, which comes prior to either of these, which is to say, I'm not quite sure what I'm looking for. I've sort of got an idea and it's sort of moving back and forth between the inductive and deductive paradigms.
And so it's abductive reasoning. So again, semantics, bringing in intuition, bringing in relationships. And so this is a thing that's largely being pushed completely out of data and information systems especially for science and semantics is actually bringing it back. So to give you a couple of examples I'm just going to give you two slides on sort of two paradigms of the way in which science is being conducted. One of them is in this concept of virtual observatories. So at the top here you have two real measurements the black triangle black diamond and the gray diamond. And what you really want is to know what happened where the blue diamond is. and you didn't actually take a measurement. So you virtually want to have a measurement. Now it could be done by interpolation, extrapolation from a model, from a climatology, from a whole variety of reasons, but it's virtual data. From a remote sensing point of view, you might want to take images from three different wavelengths perhaps at slightly different times and have an integrated measurement again, which tells you something more value added, the types of data integration that scientists need. But in doing this, both of these usage patterns exacerbate our data management problems, the challenges, and now we're managing virtual data sets as well. So this is one of the big problems and and this is the revolution that's going on. We didn't know how to manage regular data sets. Now we've got virtual data sets. So about eight years ago, Deborah and I started on a project to unlock these virtual observatories. This is a little animation of a a virtual observatory slide to show you how we've implemented semantics in a production system. It's been in production since 2007. It's had evaluations. It's won best papers, all sorts of things. So here are all the different data sources. And up the top are things like virtual observatory portals, web services, and APIs. And in the middle is this semantic mediation layer which includes the ontologies capturing all the important scientific concepts, parameters, instruments but also descriptions of the data and the service classes as well. So there's a full ontology behind this. It maps the queries to the underlying data mediates across all these repositories. But the most important thing really is it introduces this interaction, this smart interaction so that you can start offering up alternate hypothesis, make suggestions in an openw world setting rather than assume you know everything and funneling people down the inductive deductive route. You push things up in front of them. so it allows this extra interaction. It gives semantic interoperability well- definfined meaning between each of these modes of accessing this data and it allows you through higher levels of mediation higher levels of of ontologies mapping to non-speist vocabularies educational vocabularies and inventory vocabularies and you can and you can go right up the stack and you know the business case here is where does semantics add value? Well, it adds value in unlocking these data resources. It adds value in inter operating with web infrastructure between them and and among between the cataloges and it gives you the ability to push it into other domains. The second piece which also is known as a VO is a virtual organization and this is the way a lot of scientific collaborations are being conducted these days. A virtual organization and you probably participate in these. It's a group of individuals who may be
geographically dispersed. They may come from different institutions but they use the internet largely and tools to come together under a common goal. That that co goal might be short-term but it's often the long-term common interest and the is information technology and most importantly if if you're in a virtual organization this is your primary takeaway. When you work in this context the role and responsibility and status you have in that virtual organization may be completely different from your home institution. And that in fact not recognizing that can cause significant problems. And this is important for scientists as they start to deal with these larger amounts of data in a completely different mode. And so a virtual organization has coupled elements of technology some organizational structure and communication. So that to facilitate massive collaboration which is a subtitle of my talk, you have to take this into account. And because they have a high degree of informal communication, lots of little emails, you have to basically instrument the infrastructure by which they communicate in in the systems that you design and attach them to the data in in exactly the same way that we we've been hearing about for Providence. yeah, so a method that we that we have developed over again about now six years. So it's a modern informatics approach. It takes advantage of the scale-free nature of the internet and semantic networks.
Scale-free meaning that the infrastructure that you build will work on small groups for small problems and large groups from large problems and in between it's a log log distribution and it involves fundamentally use cases all the stakeholders in the virtual organizations distributed authority full recognition of access control ontologies to mediate things and maintenance of identity. And this particular circular and iterative development cycle that we have which is modeled largely on the software development cycle starts with the use case proceeds around export review and and most importantly it only adopts technology late. It does several iterations before it goes and even looks at technology. So it's not technology bound. It's meaning bound. And we rapid ro prototype open world iterate redesign redeploy and this we teach to graduate students 13week course. Deborah and I developed the the course this this year Joanne and Deborah are teaching it and it's called semantic e science.
So let I just want to give you an example of a use case and they're not trivial use cases. So one of them is in marine habitat change. So this is an image that's taken by a thing called a habitat camera courtesy of Scott Gallagher at the Woods Hall Oceanographic Institution. This is interesting for many reasons. These cameras are towed behind commercial fishing vehicles off the Atlantic seabboard. They obtain about a terabyte per run and they're analyzed for scolop shells, size, shape, color, place, number, and density, fragments. But other people want to use this data as well. So what's this?
Sometimes the dirt and mud, what's considered noise to the skull counters is another person's signal. They want to understand the sediments and the rocks that are there. And the use case might be what's the temperature and salenity of the water and is this marine specimen is that meant to be here? Is it a flora? Is it a fauna? And is it part of an ecosystem change? These are the things that we're starting to answer today and facilitating things like integrated ecosystem assessments, real applications. So, it's a similar diagram to the one you saw before. Rich scientific data repositories, software applications and tools and integrated applications used by a variety of stakeholders. The Nature Conservancy, Noah Marine Fisheries, and the US government's council on environmental quality. And notice in the gray boxes, it's semantic web in action.
Largely vocabularies leveraging things like international standards, organization standards that are accepted worldwide. And so we're starting to to to build and deploy these types of systems. And what you can then get is a generalization of this where our semantic web implementations are moving away from simple data integration to application integration frameworks. So we are able now to mediate these types of application level mashups that you've heard Jim Jim talk about and ultimately these are used by people a broad variety of people. So done yet but there's a problem and the problem is that this is what we would normally consider the full life cycle of data.
It's the thing that's sort of held up as you know we go from data we convert it into information which people can look at and then it becomes knowledge we might write something about it and then we really learn something about it now I'm not going to go through this slide when I teach this I teach a course called data science I spend 30 minutes on this one slide going through it in excruciating detail but the problem is this is not actually the full life cycle of data it's a micro life cycle of data because anytime we add provenence anytime we add supporting information Any time we add context, we're adding knowledge. And so you know what real data pipelines look like.
Ready? They look like this. And this is the one without the animation. And so raw data is up at the top. Highly integrated data products are at the bottom. On the right are all the people and feedback loops that might occur. On the left are all the people and processes and metadata that might get added. This is the reality. This is the reality of the world. And all the data that you're seeing on the internet comes any from anywhere along this line. But the problem is, and you can see the text on the left, there's fragmentation, there's disconnection, there's encapsulation, and all of them are bad for what I call the illusion of transparency that we're trying to get to in open data. It's an illusion. And you can come and ask me afterwards what word I use to replace it. But the ecosystem of what scientists want. So transparency is I forgot to put it in quotes here. is part of it but its accountability, its identity and what is it is made up of are these elements in the middle and some of these are in the semantic web stack and in fact this is all these are all provenence elements and that's why it's so critical debased we all we all are emphasizing it saw this definition which is the one that we use but provenence is only part of it because in sort of answering these scientific questions determining fitness of purpose all these things the knowledge base that we use is is multicomponent provenence is one domain science of any particular science is another and because there's data processing you have to have that in there described as well and so we construct these knowledge bases and I'll just give you just a little example of two of them so this is a providence aware faceted search that we've developed in the area of solar physics where the facet boxes are along the top and maybe you can read them. Um, people can select they can put these facets in any order in any combination and the the relationships between these facets determine whether you get results from the queries or not. Um, and when you get a and there's a provenence in in here because it says the cloud cover is clear even data products. Up comes the image and here's the full providence trace with the ability to sample the metadata, find out that the cloud cover is in fact clear and you can drill down through all of this. Scientists actually really like this. Now you're starting to get this explanation, justification, verification. We're not proofing, proving, and trusting anything in this particular example. One more. Oh yes. So this is data, provenence, ontologies, RDF, RDFA, and sparkle. Another one from NASA. And this one's actually fairly cool. So, it's not going to be an atmospheric science lesson, but NASA has lots of satellites, has lots of instruments, and what people like to do is do correlations between those different instruments and different satellites to do sanity checks. So, just look at this one. This these are two different instruments measuring the same thing on on two different instruments on the same satellite. Take a look at that.
Does that look right? It's a correlation. So, good correlation would be red, orange. Negative correlation is blue. Take a look at this one. So, these are same satellite same instrument on two different satellites. Look at this. What's that? You probably can't see, but you know what this is? That's the international date line. Does the atmosphere know about the international date line? Do satellites know about the international date line? No. So what's wrong? Welyahoo yahyl, you can explain it to them at the end, but what we're doing now is on the fly, we're intercepting these selections and because of the provenence. So what is that? What is that? And we have a thing called a semantic advisor. So that when they make these selections of parameters they want to look at for a particular time and how they want to visualize it. So here's the the thing that they visualized. It brings back a chart that says this is the first one you want to look at. This is the second one you want to look at. And in this column is are they different? Well, it says okay the data sets are different. The platforms are different. The time they cross the equator is different and one is one is ascending and the other is descending. And your definition of a day is the same. But when combined w with these other factors, you produce an anomaly. and we can actually explain it to them and correct it for them. Okay, that's pretty cool. What's the anomaly? What ah the anomaly is the difference that's introduced by the same day-to-day definition, but these two factors here, the fact that the equatorial crossing times are different. So, they use the same definition of the day, but the actual equatorial crossing times are 5 hours apart. And that's produced and this is an overpass time difference which shows you exactly why the anomaly is occurring and the inas that's just a just a blow up of it so you can see and this is what it looks inside it's distributed so NASA god generates the user request I won't go through this in in any detail but it uses owl it uses domain ontologies it use general rules it uses inference it generates an advisory and it sends it back to the other And I can show you the general rule.
It's about 18 lines long. So these are the things these are the types of things that we're building for real agencies um with real problems using very modern ontology development techniques and and tools.
All right, almost done. Now some animation. So here's this diagram. I'm using it for a different purpose here. So at RPI, since the three of us have been there, we're in fact developing curriculum at the graduate and pushing some of it down to the undergraduate level. And so in the yellow ellipse is the area of data science which I teach. In the middle orange is what I call Xinfirmatics. You can ask me about that later. which I also teach and Joanne and Deborah teach the semantic e science which cover this this space here with these overlaps and I didn't put on here but underlying it all is web science which Jim teaches so we're trying to train people on so on day one when they graduate they can come and do all this stuff for you or someone you know guess you can't teach wisdom no don't want to try it just yet need something else to do in five years when we solve all these problems.
That's a joke. So here's my summary. semantic ecience approaches are changing many fields of science extremely rapidly. This informatics approach enables integration at a variety of levels. So virtual observatories bring the data together. Virtual organizations bring the people together. And we're finding massive, you know, massive can be a relative term here. collaboration through vocabulary mediation because that's how people collaborate. That's how you get that's how it works. But it exposes many issues. Transparency and the dependence on provenence and the new things that are popping up like quality, fitness of purpose and trust and how you make those computational. And so what we're doing in from my point of view is is exploiting this tension between production and applications and research issues and implementing real things. And my goal personally is to restore abductive reasoning to the conduct of scientists. whether you're a specialist or a non-speist and that is going to be done with semantic web in an open world environment using the internet as your primary computer and I'll just finish with that slide that shows the the constellation thanks 20 minutes do you want to take some questions or do you want to yeah So for for solving really complex
problems, how do you choose what data to use as a source?
Good question. So very very quickly here I'll just pull up this diagram. So this team of people here, this small team includes the scientists, it includes people who are familiar with the data as well as well as software engineers, knowledge representation people. object modelers and so on. So they're in it from the start. So they identify so there's two modes. You identify the obvious data sets which you saw at the bottom of some of these and then increasingly we're going out and finding data sets because data sets are now being broadcast using things like GSS and Atom. And so we're going out and finding those as well and if they have the right markup we can discover them and see that they're relevant. Peter, can you come to your last? Y. So Marco just asked me to take one minute to sort of answer the question he asked before sort of what next, what do we do, etc. So to start with, of course, we're a university group. For those of you who don't know, RPI is in New York State. We're not in New York City. We're about two and a half hours up the road in Albany, Troy area. Not in Rochester. Not in Rochester. That's a lot of people think we're Rochester or Rochester Institute Tech. We're RPI. Renelers. we offer the traditional stuff that academic organizations do. So we have undergraduate graduate programs. Anyone looking for a master's degree, we have an information technology and web science program which has a lot of this stuff in it. We're in the process of creating a PhD program specifically on this stuff. We just can't figure out what to call it yet. We're we'd love your ideas if anyone who would send anything. We also, as you probably have noticed, are not like a traditional academic thing. You didn't
see slides full of mathematics and things. We're very interested in the whole transition of technologies, building technologies. obviously have a lot of semantic web background flavor. I actually am getting bored of that. I've been doing it for 20 years. So, so I want to do web 5, but I used to say 40, but now you're pushing me. So, but anyway, but so there's that. But again, you know, we have a lot of opportunities. Um, we don't do any real training courses. We've been helping people develop training courses. We're very interested in working with people who who'd be interested in taking this stuff further. In a sense, what we'd really like is is to stop doing the stuff that's evangelizing this technology and go back to what we were doing a decade ago, which was being the lead. And I think Peter's shown you and Deb sort of what some of those leading edge things are. So in a sense we've sort of you know played with this semantic web stuff a long time are moving in some of these new directions very interested in finding partnerships across that whole spectrum. So you know you have our contact information on the mashup stuff. We love talking to people and stuff like that. Marco wanted a second after and then I think he was going to throw it to questions just at this point interactive conversation parties meeting around Definitely would like to know just thinking I mean why don't you mash up everything that we spoke about perhaps have you considered maybe in data dog now that we have this all these curated data sets having some computationally you know way of doing like cherry picking through okay you know what I'm I live in New York City I log into data go I put in the related data sets and you know what this might be something something I might know that somebody in Washington DC doesn't know. I can then add some more. Right. So I actually didn't get into that whole thing. that's a big part of what I'm actually moving into this. So and and the science stuff is a particularly good domain in this. So policy makers make decisions based on this data. Those policies affect our lives. Yet we don't really have any input back into the data. And the qu and and one of the things that's really fascinating when you talk to the CIO and CTO of the US and people like that, they want to turn it into a conversation. And the question becomes, what does it mean to be a conversation? So they've got a site you can get to from from data gov that will let you suggest data sets, make comments, tell them about your apps. Anyone who develops an app off this stuff, you can, you know, ask them to put it on their page. But the data itself is is something you know it's a lot of stuff. How do we start talking about it? How do we come? How would how can you say what's going on here? So one of the things that's really interesting in these visualizations is we've been going back to to government organizations and saying hey guys you've got a bug in your data and they're like how do you know? And I'm like look right you we got I didn't show all these examples but like we have wildfire data you know how many acres were burned? How many fires in 1985? The answer is zero. and we're pretty sure there were wildfires at IA5 and that that's a data error. So we showed that to the you know so so the guy who now runs data gov is the former CIO of the department of interior who's moved over to data gov and he says one of the reasons stuff like that started convincing him that this is good for the government. This is how they will know more about their data. one of the reasons we we did this mass was so that they could do some of these visualizations before they release the data so they can find some of their own errors. I have I have a whole bunch of anecdotes I won't go into. When you start getting into the scientific data much, you can't do that by eye, right? You can't look at, you know, some of the stuff Peter's talking about and say, well, you know, of these million observations, we think these 37,000 are somehow anomalous, right? So, so you really need to start looking at what are the tools, techniques, etc. Uh, and that's also why you need this whole infrastructure. Where did the stuff come from? So, so again, yeah, you're you're mashing up all our stuff and in fact that's part of what we came together to quick follow on to that the So, if you're going to get massive collaboration, you have to solve the last mile problem which is not to computers from computer to person and localization. Okay, it's familiar things that they can actually do that are actionable fundamentally important and you know the web is making that much easier. And how do you make it make a connection to them? Um, so we we might know something about who you are. So you said your location. So that's one thing about who you are. But when we might have additional context and if we can tie into something that you understand, you're more likely to understand the rest of our message. Yeah. I'd like to go back to the question of intellectual property this open data and so on. I'd like to ask you know what kind of is it truly open public domain data or what kind of pitfalls should we look for example NIH has the UMLS you know 100 vocabularies but you know a number of them have their own licenses and so on so what happens with this data is truly public domain can be used in any country and so forth so so data gov any data set they release has to be under what's called the data gov release policy which basically says can't reveal any personal information can't reveal anything that you know against the US national interest and other than that anybody can use it for anything so it it's very open now in the Brit British case they're actually releasing some data sets that have some of those restrictions so they've spent a lot more time looking at issues like the creative so so it's easy when the data lives in a document ument to put a tag on that document when the data comes out of the document how do you put a tag on the data right you know how do you how do you say this if I mash up this data with that data how do I know that's where a lot of the prominent stuff comes in so there's still a lot of hard problems down the road the US has found one solution which is simply release the stuff that can be released the UK has been using some different things some of the other governments so pretty much what the governments are doing is they're giving away stuff they're pretty sure is okay no matter what you do with it. And that's limited some governments to give away much less stuff than if they do it. So we think as technology solutions help put policy restrictions on things, explain things, bring things together, let you know where things come from, that may help. But right now, at least for data gov, everything there, all 273,000 data sets are available for you to do whatever you want with in whatever way you want. And in fact, they encourage you to Gail said towards the end something about FOYA requests, freedom of information act. So there was a a White House blog about three months about two months ago where they said since they've created data gov, the amount of money spent on foyer requests by the administration has gone down by hundreds of thousands of dollars because people had to ask for this data. They're like if we have to give it away, if they ask for it, why don't we give it away first? Now what's the problem? Search. Right? How do you find the data you're looking for? Right? So, so it's, you know, the a lot of the FOYA guys are saying, "Can we have this information?" They say, "It's in data.gov." And they said, "Where?" Government's like, "Oh." So, but that's not true in in all areas. So, it's true in the data.gov, but in some areas, uh, you do need to put some policy information in there and then we need to check for compliance. So, if I'm, um, allowed to give it to you, uh, if you're required and I wasn't procluded from giving it to you, but then if you're going to be required to give it to somebody else if they ask for it, then I might need to limit my ability to give it to you. So, we need to encode the policies and then check the compliance separate research uh efforts. We've got work on encoding those policies and checking to see whether they're complied with question about linked data. So, two different two same question for two sides here. can you give me some examples of linking data sets on data gov and then also on the e science side again linking data within the data set not a mashup outside the data set within data sets themselves right. So we're doing several things in that space. So first of all there is when you do the mashup against stuff that's common terminology. So for example one of the Google visualizers is a timeline. So I got several different data sets with time information in them. I can pull those together even though they weren't intended. So the simplest kind of linking is common naming over stuff we know. So one of the things the governments are very interested in and there's actually an international effort starting up in this is you know what are some some specific kinds of terminologies for those things. Second of all we do a lot of linking to the DVPedia to to census data things like that. So we're actually all our data once it's an RDF now has URIs. Those URIs could be linked to other URIs by either the complex mechanisms owl same as by procedural code by I think so in in my world the sort of broad world we're trying to solve it without deep reasoning we're trying to sort of say somebody sticks that information in there and everybody takes advantage. So that's sort of the linked open data cloud and our six billion triples are tagged and the 13 billion is not enough in-n-out links and it's true because most of our links are to the data set level not to the data element level but of course the data element is 14 if you don't know what it stands for you can't do much so we're we're very involved in thinking that stuff through but again my world is a very broad very shallow world Peter the other well I would say on on the scientific side there's not an enormous amount of what's termed as linked data yet the agencies are starting to to work on it I would consider the work that we've done as very linked data and so we do data integration within data sets all the time you can ask questions like what's the state of the atmosphere and it knows to go and get temperature pressure and density and composition without you even having to ask it so there lots of smarts and these are ontology using you know domain expertise. So these examples I think are more complex understanding and then integration. so I think actually that's kind of one end. and then actually in the NIH end we've kind of kind of in the middle where we do a little bit less understanding and a little bit more integration. So we've got ban by region, ban by state, ban by county, time frames that don't quite map perfectly and then we have to basically do a translation step so that they can actually be integrated better and and you know so the fast answers there's a continuum and what you can see was fun about bringing our lab together is we want to figure out how we stop these being different communities that don't talk to each other and really start to understand across the whole thing. But that's our scientist.
Yeah. Yeah. Yeah. I was wondering one of the most contentious issues is the so-called social graph essentially a lot of personal ontology and so this to me is a big problem that companies are monopolizing people's identities and their networks and that will place severe limitations on your ability to integrate all your personal data. is there any open way to to solve this issue and you know establish the unique identifier for me that I can link
to all the other data right well so so trick number one is establish a unique identifier for you when you want but not that's always unique right so if you go to this system and this system and you don't want it known that you're the same guy you also need some so so what I'll tell you is there's a lot of thought going into that we work with I'm I'm working now with the guys who are doing what what's called web ID. So these were the the guy who who sort of has pushed the be the idea that's becoming best known was the chief architect at Twitter where they had exactly this question all over the place. Right? I got this Twitter persona. Twitter built itself as an open thing that wanted to link stuff but it can't because they can't touch into this and into that. So, so web ID, um, the idea is to have something sort of similar to an email address, but reverse indexable to something on the web that when you look at it would say, here's what I'm called in Facebook, here's what I'm called in Twitter, here's what I'm called in email, that kind of thing. So, in in in sense doing that and then the next step beyond that, something Berners Lee and I have been playing with a little bit is groups. So almost every application you use, you have some kind of group model. In email, it's a mailing list. Facebook, it's a Facebook group.
Twitter, it may be friend groups or or there's a lot of other, you know, kind of informal mechanisms on and on and on. And so the problem is I'm I'm here now. And someone says something or shows a URI and I say, gee, I'd like to share this with everyone in my lab, right?
Well, now it would be application dependent. Who in my lab is in that application? How do I know all their addresses? We want to break that stuff2:00:012 hours, 1 secondopen. So what you'd like is sort of an O. So the original idea was open social and that's sort of moving. But the newer idea is open ID like stuff but with real meaning behind it that lets you control that linking and also lets you deny that linking. Right? So if you don't want me to know where you're who you are in Facebook, you should have an easy way to just not have it show up in that stuff. And that's a little bit problematic because Google wants everything in there and Yahoo wants everything in there that Google doesn't have. And so, so again, there's a lot of IP business things like that. But I'd say that if you've been following this wired the web is dead, long live the internet. No, the web is wonderful. which I actually think is a stupid debate, but I'll I'll take that offline later. But a big part of that is are we going to move to an apt kind of world where everything is separated? So we'll silo everything or are we going to find a way to put it together? And where's the place to put it together?
It's in that semantic backbone of the of the web. And that's really where a lot of the I think the the the cool stuff happening below the infra, you know, really deep in the infrastructure. The guys who are doing this are are, you know, deep webbies from the old days who are really looking at, you know, what what's the right kind of naming conventions, how you do this. So, it's pretty cool stuff, but I think we're still a few years from seeing it really break open. Yeah. kind of in following up on what you just said, I see restful style, I see restful style solutions as an important way to make this stuff happen and I didn't see it in your instead. So, you know, I mean, is it just starting to make sense in this context or is it, you know, I'm going to give you a fast answer and and be happy to talk for hours after it if you guys have other opinions, but I'm sure, but what I'd say real quick is that the web has become a development platform.
In fact, one of the things I think slowed the semantic web hitting as big as it as we thought it would, as fast as it would. Now it's finally turning some of the corners we after 10 years that we predicted after five. And a lot of it was the realization that we had to get back into that infrastructure, right? Really change when you change the labeling of web link, something big is happening, right? You change the meaning from a tree to a graph. Big stuff. So, so again, a lot of stuff had to percolate up and there's actually some pretty cool stuff down there now. So, what I'd say is a lot of this stuff really does fit with REST. a lot of the new sparkle protocols, a lot of the sparkle 2 stuff is really one one I guess they call it now is looking at that direction. So, so you know, we're firm believers in it, but there's also a group that says, well, you know, now you're also seeing kind of this third. So, you've got the internet level, the web level, and now you're starting to get this new API which says abstract up the information levels, right? So, this the web is all about making it so you don't need to understand how the internet works when you're developing an app. So why should you have to understand how the web works when what you're trying to do is say get the data that Deb's using and the data that Peter's using and show it to me on the timeline that that Google gives me right why do I need to know anything about RDF and that should all be sort of somehow you know back there in the infrastructure there's a lot of discussion of the questions you're asking I don't think there's a simple answer but if I'm a provider we sure hope you do it as either REST or a simple service API or an API So, so, so you know the example I'd give you is is visualization has for years been the long pole in the tent of doing anything to data, right? I got this data, I process my data, I run it through some data analysis thing and now I can publish a journal paper that says I ran my data through the data analysis and the number is 63. And then if you want to do any kind of demo of that and stuff, you had to hire a whole new team of programmers and a whole new we're doing it all through APIs now and XML transforms and things like that. So, so again, this web level of of abstraction is doing some very cool stuff in that space and and that's pulling some of the rest of some of the services. I I won't go on all night. I could I was about to say I could go on all night and Peter would say you are about corporations who might have had very comfortable arrangements in the past with government data. Do they fight or is that an issue at all? Do you want to talk? I feel like someone else that question has that been in the news recently about an oil spill. It's it's there. And so the short answer is when these types of revolutions in how we do things come, people don't go quietly and it it takes a while for these these changes to propagate and we're seeing resistance. But and that's why a lot of there's a lot of initiatives about openness and claims of transparency because the more you make people aware of it, people started to get interested in it and start to hold those those entities whether they be companies or even governments accountable and accountability was on my on that slide there. And that really is starting to change at the international level and this is true for science and medicine. there's a worldwide effort to to push these open data open data policies everywhere. Absolutely everywhere. And as their resistance like you would you know it turns out by luck that one of the best things about the semantic web was we weren't as successful as we thought we'd be as fast as we thought we'd be and I can give you a long discussion because in a sense what happened is a lot of people were attacking other things where they thought there were more money while we were getting our act together. And now all of a sudden, you know, I'm reading papers that say, well, the semantic web will never happen. And I'm like, you ever hit a like button, you know, that's RDFA. I mean, that's stuff that came right out of those early papers said exactly how to do it. I mean, so, you know, this stuff is really now coming out. And I think because it's coming to the infrastructure level, it's hard to keep it out. And that was a lot of how the web worked. So, we cross our fingers that what I just said will keep going.
But well, I think I think like the and actually something that was on the cover of the times recently healthcare again with Alzheimer's and the searchers and collaboration around that effort are all positive things. So, but I'd like to ask a slightly more technical question about how you guys are moving forward with interoperability typically in science and with graph databases and the kind of graph programming that are at Sandina is pretty so we're starting to do a lot with graph graph based importance and even a lot of the algorithms that are coming out now are graph based algorithm because in in science especially I didn't go into in great detail. you know a lot of the algorithms especially on the semantic web scale are n^ squ n cubed.
Okay that's terrible in mathematical terms. You want to be n login and even we're talking now about sublinear and that's the only way we'll be able to keep up and currently all the techniques are based on graphs partitioning you know trimming the graph all these all these types of things. And so the intention really as soon as was talking about APIs, the way you deal with it now is that such a low level of programming algorithm that has to be raised up a very very significant level. Otherwise, no one's going to be able to create apps to actually use these, you know, find results from these data or use these applications. And so we've got a whole API evolution that needs to come along.
I'll leave how many more we take. Yeah, we have another five 10 minutes. Okay. I would go with a question that's a little bit opposite. You talk about open data, but a lot of data is not open and actually there's some reasons good reasons for that data not to be open and be hidden in websites because you don't necessarily want to reveal that. Now, what is the is there work that's being done behind that that goes into the security perhaps a security layer for Oh gosh. Yes. Next question. So, so first of all, there's a lot of enterprise stuff going on. Obviously, it has to think about that. Second of all, there's a lot of work I mean from day one. Remember, a lot of this stuff was funded out of DARPA right at the beginning and was about government interrupt. So, you know, it's not coincidental data gov uses semantic web stuff in both the US and UK, the biggest projects because designed to do that stuff and it's just we thought it would be within the government they did it and we had to get them to give it out first. But but a lot of that SEC so so there's a lot of security level there's also a lot of stuff that's so what was really nifty is I was at you know something in a community that's not doesn't like to talk and you know I have to sign papers I'm not allowed to tell anybody and bet anything and you know they're asking exactly question I said well you know you already have a web security model right and they said yes I said okay you're done right I mean to one level of approximation this is just web stuff now when you start talking about data aggregation, data propagation. I mean there's a lot more there but from just simply the locking the door doing the access control the whole web is hard to do that for but the semantic web doesn't make that much harder right it does add some new and interesting issues and so there's a lot of us working on it that I mentioned we were doing this work on policy stuff that we've been doing that for six or seven years now with the MIT group that Tim runs and Tim Ernestly runs and it's all about policy awareness how do you make policies is explicit but but how do you make things accountable so so the current security framework is if I can't protect it completely don't put it anywhere of course the new zeitgeist of the digital native is we want to share everything so the question becomes how do you build a
security and privacy and control layer that's that's usable by people who want to share not by people who want to hunt and so there's a lot of interesting discussion. There's been a lot of workshops and and I won't say there's anything, you know, I can point it at and say, "Yeah, there's a solution right there." But I'll say that that's very active and exciting topic.
And then there's also work that looks at if you think you want to share everything, should I inform you about the potential consequences of sharing everything and should I provide some infrastructure that allows you to protect things that maybe you should have protected, but you didn't think about it. So there's also kind of this this ethical moral kind of slant to all of this that not only are we
providing technology providing and designing technology but we're also trying to think about the ecosystem that that should sit within and being responsible creators knowing that some of the people who adopt it are just not going to have thought about all these issues that they maybe should. Well, I'm thinking for say for situations like say healthcare, right? I mean, right now people get up in arms about well Facebook is giving away too much of my information. But guess what? Your healthcare information is going everywhere and what sort of control are you ever going to have on that? And if you do, will that would there even be infrastructure for well I might be giving a control or permission to this company, but now this company has all these relationships with all the other companies or all the other sources that might be using that and how how does that marking or that data traverse to the entire semantic web and however it's being right right to so policy is part so a transparent policy is a policy that can be encoded and enforced is part of it um but another part of it is if I if I give you all the knobs and bells and whistles to protect it can you do it and can you even anticipate what you should be protecting so healthare care information is kind of the first example. So, HIPPA is probably doing you a favor, but sometimes it's not. You know, sometimes it's keeping information from your doctor or your healthcare provider in a timely manner that could have saved your life. So, you know, then it's really hurting you. And do you want to pass along lab results to all of your doctors? You know, often you say yes, but there's some tests that you would only do if you had particular diseases. So, the fact that you had that test makes me, if I'm a doctor, realize that you're probably HIV positive for something, you know, just to pick one. Um, so you might not realize the implications of letting something out or keeping something. I have a question for you. Um, so how do we make money? The good news is we're academics.
We're the guys who who figured out that we don't have to make money out of it. We took a vow of poverty when but I didn't take as much of a vow. So I like to be on boards that help companies.
Yeah. So so I mean I was on a board of a startup that last week went bankrupt. You know could VC pulled out. You know I'm happy to say the reason we lost it was because another company figured out how to do the semantic stuff better, faster, cheaper than we did. So you know I think I think the real answer is that it's It's an infrastructure for innovation. This data stuff in particular, you know, the government is saying to people, please find ways to make money off of this, right? The Democrats are saying, "Please find ways to make money because then the Republicans won't be able to turn it off if they if they become the winners." The Republicans are saying, "Please make money because then we're able to show that, you know, we back the commercial part of this stuff, not the area or you know, that that crazy science crap, right?" So I mean sorry my political may show but again more seriously I mean you know so it it's really interesting across the aisle you're seeing different reasons they want to share this stuff but it all comes down to creating value via innovation or also saving money. So I thought Gail had a nice example where she said Foya it's a cost savings perspective. Um, so you know, it's a new way of looking at something that they were mandated to do anyway. Finding you what you're looking for when you don't know what you're looking for is the big value stuff. So, you know, again, we didn't talk a lot about the commercial edge of this stuff, the RDFA stuff that's going on now, things like that. I just sort of assume you'll have other people to mash up who talk more in that space. But, you know, I mean, 10 years ago, we said there's value in doing this semantic stuff and people kind of giggled a lot, right? And now 10 years later, we're not having a lot of trouble convincing people that there there's value somewhere in here. And we're being asked exactly the question, where is that? Where's the specific value? And that, you know, we didn't know how to answer that for the web for the first 15 years and turned out to be advertising instead of pornography. It surprised a lot of us. Um, you know, some of us thought it would be something better. Before we take the next question, someone mentioned seen before. You know, just take a look at Zene. haven't been there, you should check it out maybe of it. semantic technologies. It's the is the business of commercial leveraging the semantic web and most of us sort of started going there five or six years ago when you know the small amounts of people and people struggling to to sell the you know the snack web and suddenly Oracle shows up and all these startup companies are coming along and you know there's value in apps and again you know the the ad economy of the internet is driven by knowledge and what we're talking about is making some more of that explicit and machine uh manipulable. So you know the where the money is is easy. The how to extract that and commercialize that's the hard one. But you know search engine optimizers are looking are glee are looking glee at RDFA. They're like how do we get this everywhere? things like that where you know a year ago they were like that stuff will never catch on. We're seeing really amazing changes in people's thinking, but you know, I don't think it's yet hit the point. Did you have a question? Well, so I think actually the Alzheimer's is the most recent one that they said they couldn't have made the the findings without sharing that data. The sky surveys are some nice examples. When they shared dat a, you've got high school teachers collaborating with astronomers because they one person saw in the data and it turned out to be something really of interest. I think actually we can point to you know at least dozens and hundreds now of examples where sharing the data and working in some kind of collaborative manner is starting to change the way discoveries are made. I Peter and I started this what a decade ago and I think we're starting to see um the our users changing the way they look at science. I remember Peter introduced me to one of his collaborators who was very hostile to this new way of doing science and he was like I don't need that you know I know how to do this and then when he decided off through his own intuition that this would change the way he looked at forming an experiment then he became our biggest supporter and I think that yeah speaking of Right.
There's so with inside with inside FBI so we have become somewhat popular inside FBI as well and the single largest outreach to us is coming from the humanities social sciences cultural anthropologists because digital humanities is if you think science has taken off digital humanities is going through the roof computational economics computational sociology you know we're talking to people who are running things like asthma portals who are finding that the environmental factors is a real now back as a big thing and guess what you need environmental scientists and they don't speak the same language. So this is all over the place.
I was on a phone call with the sort of chief visionary technologist guy at a at a medical company today and he said he had just come back from a meeting held between sort of some of the biggest companies in the world of the web and some of the biggest medical guys talking about the future. And he said, you know, the one thing everyone agreed on was 20 years from now, the hospital won't exist the way we think of it now, right? That it'll be sort of it's going through what libraries are going through now.
That used to be the primary place you went to get information. Now it's not. And they have to reinvent, you know, what is their role? And they believe the hospital will stop being the primary place you go to get health care and health information and and to do monitoring and to do testing. Right? He said there were 20 different visions of what would replace it and how it would happen. But you know it's a big area an exciting area right now and you know the amazing thing to me is this technology is one of the things that people are agreeing is probably a big it it's not sufficient but it's necessary and you see a lot you see it at all levels both at the PhD researcher but also like patients like me have you heard of that website or trial. Yeah. So people who have some medical problem are sharing information about that medical problem and then they're forming communities. You know you can debate whether it's good or bad but it's you know a new phenomena that was enabled by this kind of technology. Just a quick follow up point on that. One interesting aspect of that is people are looking at their like patients like me and things like that where they're getting and they're self tagging like stage bio as well. They're selftagging it with semantic information that's shared among the community and then that information is being funneled back to their scientists for the phenotype information. So then they correlate people's actual you know what they're what they're experiencing in their life in terms of symptoms or or whatnot back to the genetic information. so it's an interesting case of kind of like spec data flowing in in multiple directions where the the genotype data is semantic and people take phenotype data that's essentially mashed up by organizations like patients like me and then then it's new it's very interesting as well and just to conclude this the question asked before are the vested interests who control this stuff you know hostile like you wouldn't believe that ownership of information is the biggest thing in medicine today. The sharing of information is the biggest thing in healthcare tomorrow. Talk about talk about, you know, a conflict that our society will have to face sooner or later in a very big way. And and I mean, yeah, it's fun to be one of the technologists in this space. Boy, I keep my head down a lot when we get into those conversations. Absolutely. I think that's probably a good time.
Marco, that's probably a good time to stop because you wanted a few minutes. Thank you so much for coming.