History of open data and Yandex Hackathon

On September 14 - 15, the first Yandex Hackathon will be held in Moscow , the participants of which will create projects on the basis of open government data using Yandex technologies for two days and two nights.

For many years now, I have been making Russian developers more interested in working with open data. That's why the Apps4Russia contest was created, organized by the Information Culture non-profit partnership. This year, a nomination appeared for those who create applications on open data and Yandex technologies. These events prompted to systematically tell here about the history of open data, its sources, examples of use and many other important things.

image

This is a graph from LJ eugenyboger . The fact that now we can find out the detailed election results for each polling station is the norm, and more recently this was not the case even in very developed countries.

Open Data: Background


There are several definitions of open data. There is one that is given on Wikipedia: I myself translated it from English to bring in a Russian-language article. There is a definition on the website of the Government, which is given in the law . There are a few more definitions, but the point is as follows. Open data is information published by the organizations that own it (authorities, if it is open state data), provided in free form (i.e., non-burdensome free licenses) and in machine-readable form suitable for repeated automatic processing. There are some criteria that define data as open. A Creative Commons license is almost a prerequisite for open data.

In principle, open data is not a new phenomenon; it has long existed in various forms, and the ideology of openness has been around for many years. Open source and free licenses appeared not five, not ten years ago, but much earlier. Especially in the scientific community with its research, for the results of which the ability to verify, verify, publish and work with them in every way is important. Research, as a rule, is a special format that is exclusively - what we now call - machine-readable.

Today, developed nations around the world are striving for openness in a variety of ways. At a G8 meeting in the UK this June, the host proposed signing an Open Data Charter . It was also signed by Russia. The main principles that are spelled out in the Charter are the openness of default data, their timely publication in machine-readable form, transparency and the obligation to ensure the conditions under which developers will create applications based on open data.

“By accumulating huge amounts of data, authorities and businesses do not always share them so that they can be easily found, used and understood. These are missed opportunities. We have come to a turning point that portends a new era. People will be able to use open data to generate ideas and create services that will make our world a better place, ”the Charter says.

Now, all G8 countries must declare their willingness to disclose information on crime rates, the register of companies, land transactions. The leaders in this business, of course, are Britain and the USA, who have been doing this for many years. But now many countries of the world have begun to publish open data, including Russia.

This was influenced by the growth of all kinds of technology companies, and from this the growth of the value of data, the growth of the knowledge economy, the emergence of such large companies as Yandex, whose entire business is built on freedom of information. If every site on the Internet was paid, and the data could not be aggregated, the problem of open data might not appear. Working with public domain information has influenced this a lot.

As a result, several trends came together, and this very phenomenon appeared - freedom of access to information and open data. It consists in the fact that information created primarily by the state, and generally by anyone, in principle, should be available and also so that it can be reused. If someone conducted some research, and its results are presented in the table, we should not get the picture, but the table as it is, so that we can check it, use it, and maybe even make a business based on these results. If the state reveals some information about its activities, then it is useful for citizens not only to know about it, but also to do something based on it. Maybe it will have a social effect, maybe it will have an economic effect, maybe it will be the effect of “civilian control”, “Civil fight against corruption”, etc. But still, this is an economic effect, albeit in a slightly different form. Open data portals, on which huge amounts of information generated by the state are uploaded, are mainly created by governments. From it one can make something interesting and useful - this is how the ideology of openness is transformed into concrete products.

But everything did not come from officials and the state, but from people who began to do this much earlier. In Britain, before the open data portals appeared, there were a bunch of different small groups of developers who started doing projects like “let's relink the state” - rewired state. Or, for example, ScraperWiki has long existed - a special engine with the help of which any person who knows a little programming in python can write programs and scripts and extract site data.

Gradually, it became so massive that it didn’t matter whether the states opened the data or not - they somehow learned how to extract it. In the United States before appeared data.gov were Sunlight Labs , the Knights Foundation , which extracted data from Congress reports, converted PDF files to excel files, loaded the excel files into a database, and there they converted them to .CSV. Strong public pressure led to the fact that in Anglo-Saxon countries, officials and representatives of the authorities came to a state where they either do it or do it for them. And if David Cameron did not cling to the topic of open data, didn’t include it in the party’s program and come to power with it, then the green party would come, in which openness of data is now registered in the program. And this is the openness of not information, namely data.

image
Infographics The Guardian Datablog

And the right step for the state in such a situation is to try to lead the trend, and not to resist it. And it does so, trying to expand it into those vectors that it considers priority. This is not so bad, but has its own specifics.

In Russia, the situation is about the same. I have been engaged in open data since 2009, until which our government had no action in this direction. For two years we were actively pushing the topic, and when it became completely clear that we had advanced to such an extent that we did not need the state, suddenly its representatives realized that it was better to lead this trend.

Moscow has a certain claim to leadership in this - here, for example, they made the budget portal earlier than the feds did. In my opinion, the data placed there is imperfectly convenient, but you can work with them.

Usually, the first to use open data are civic activists. For example, in the United States they compare congressmen among themselves, make up various ratings. Using transcripts of speeches, they find out how many words the congressman spoke during the quarter.

Open Data Status


Data usually exists in three conditional forms.

First one. They are affordable and suitable for work. That is, the state or its owner ensures their machine readability. Here the entry threshold is minimal - we can take them and put them on some cards, apply them on a mobile phone. Everything is ready right away.

The second one. The situation is worse: there is information in principle, but it must be extracted from various sites. For example, information on State Duma deputies is on the State Duma website, but in the form of web pages, it needs to be extracted.

Information on water quality in the city of Moscow by region - is on the Mosvodokanal website. But through a special service in which you must first enter the street, then the house number and only after that you will be given the district, the level of pollution, pollution levels for various indicators.

In order to collect all this information, activists write various scrapbooks - programs that remove information from websites and turn it into some databases.

The third. Information in some form exists, in principle, but is not available in public space. In general, all we do is try to achieve transparency of information. I am talking now not only about myself, but also about many other activists who are actively involved in this in Russia (including commercial companies) and are trying to achieve openness of information, that is, the following:

  1. So that the data that is already published in a machine-readable form is suitable and convenient for work, so that there is a minimum number of errors in it.
  2. So that the data that is not machine-readable now is converted. If they are published, let them make it useful, that’s the most important thing.
  3. So that what is not being published now appears in public space.


For this, the so-called Open Data Council has appeared in our Open Government . The state said that it was ready to participate in this, some changes in laws and regulations were being adopted. In principle, in order to start working on ensuring the openness of data and to use it, there are no restrictions.

Open data sources


Open data is not only public. This is largely the data of huge crowdsourcing Internet projects. Not everyone knows that, for example, all Wikipedia is available in the form of dumps. Or Wikidata . This is generally just an amazing ideological project. And DBpedia comes in from the other side. Wikidata is for people to gradually convert information into data, and DBpedia for sharpening algorithms so that previously entered info boxes can be turned into connected data. Freebase , which is now bought by Google, was completely built on DBpedia and Wikipedia. The guys just downloaded the data, made an interface that allows you to add something else additionally, and based on this we made a rather expensive product.

Project OpenStreetMap . Likewise, huge data dumps are publicly available and can be used. There are several dozen projects that are open as crowdsourcing and from which you can collect data. These are mainly various encyclopedias, reference books, user databases.

For example, in France there are activists who monitor products and add their ingredients, EAN and EPC codes to a separate database and distribute. Thus, a directory is created by which people with nutritional restrictions can understand what foods they can eat.

That is, one part of the data is what activists create in different forms, in different forms, and the other is what the state provides. It is the largest data owner. And the third part is the data published by commercial and non-profit companies.

The former usually publish them in two formats. Either under duress, or guided by social responsibility or other motivation. For example, some are so attracted to developers. Nike publishes machine readable information for its plants.

How to use open data in the world


Developers very often ask: “What can be done based on open data, what are some examples?” And I always suggest looking at what others have done. Just look at the competition sites NyCBigApps , Apps4Development , Apps4Berlin , Apps4Finland , Apps4SanFrancisco . Although not all examples of them can be transferred to Russia.

The guys who created the project “ Do not eat here ”, They didn’t even take open data, but parsed data from the New York Food Inspection website. They found where the addresses, company names and results of the verification are indicated on it, marked them on a map and made an application that works on the principle of the same Foursquare. It, based on the number of issued and unclosed prescriptions, shows where to go is not worth it. The application was even sold for some small fee and people installed it.

There are a huge number of applications that are part of the City-Go-Round project. . This is a small portal in the USA where information on transport companies and applications is aggregated based on their data - 2000 companies are collected in a separate list. 270 of them provide transport data on a regular basis in a special format - general transit feed specification ( GTFS ). And thanks to this, hundreds of applications have been created on this data.

There are, for example, new media projects like Storify . There already a huge amount of open data is loaded, which you can use in your mini-newspaper - to create harfics or other complex visualizations based on them. Thanks to this, you can supplement your stories. Storify creates an environment in which people themselves come up with how to use open data. In the same series, you can put a lot of projects that create infographics online, allow you to draw charts, load ready-made data and manipulate already open ones. This is Sacrato, Factual , the same FreeBase that Google bought from MetaWeb.

image

It’s not always possible to make money on your application, because the data you’ve used is not always enough to create a complete product. But you can monetize the result in other ways.

Data is like some ingredients. If you do not have salt, the dish will not taste good, but you can eat it. If you have salt, then you can sell it more expensively, or those you feed will be happier. Sometimes the data can be the dish itself, and sometimes this salt itself. That is, in any case, they, as a rule, are rarely self-significant. And a lot of projects that work on open data actually use them only as an addition.

For example, in the USA and Great Britain real estate services are being transformed very quickly. In addition to the familiar criteria that everyone has long been providing, they began to show, for example, the criminal situation or weather data in the city where you plan to start living. Where does all this information come from? In the United States, weather data has been publicly available for the past twenty years. This is the most monetized open data in the world.

Crime information is disclosed by police departments. Several dozen projects have already appeared that are based on it. Information on the environmental situation is also published. Again, it is either part of state monitoring, or commercial. Therefore, I always tell developers to think not only about what they can do on their own, but about what it will be possible to embed in the result of their work and how to earn extra money on it.

And one of the ways to apply your development is indirect monetization - selling what you created. For example, the guys who did the Chicago Crime Crime Monitoring Project, sold it to MSN, which made it part of their portal.

And the British are very proud that after the discovery of data on the success of heart operations in different hospitals, they have reduced the number of deaths - people began to choose hospitals based on this information.

A huge number of startups that arise in the United States on open data are created to complement open data with various existing ideas.

Open data in Russia


One of the most important things in working with open data is the convenient format. In Russia, this is often not respected. In addition, despite the fact that we have adopted a law on open data, many government agencies may not pay attention to the information on their websites. For example, it is often forgotten to update.

Some open data in our country is published by commercial organizations. For example, the Russian Language Corps , which is supported by Yandex. Russian Railways publishes all information on the benefits that it provides. We can find out who received how many benefits, information on tariffs, financial statements. You just need to go through the websites of corporations and see what is published there.

image
Schedule based on data on turnout in the mayoral election of Moscow

The exam in all its shortcomings has an important plus - the quality of education in schools can be measured. But the data are scattered, so there are no decent projects based on them. And it would be possible to make the “Pick a School” application or add this information to real estate services.

Another part: it is housing and communal services. Moscow authorities began to disclose a lot of information about the housing and communal complex. The portal gorod.mos.ru has information on each house. If you parse the data from there to all houses, you can find out how many people complain, how quickly they respond to their complaints, etc. You just need to collect the database. And although the developers of the portal have not set such a goal for themselves, nothing prevents us from making it ourselves.

Our country is one of the few where data on public procurement is fully disclosed. Processing them is not a very simple task, because it is big data. But they can make convenient services, for example, for suppliers.

State data in Russia is now scattered across a bunch of portals. Each ministry, each federal agency has its own special section. We have several open data portals: the Moscow portal, the portal of the Ulyanovsk region, now there will be a portal of the Tula region, Perm Territory, Perm. "Informkultury" has a portal hubofdata.ru , where we load dozens of gigabytes of useful and not very data in bulk scripts. We have 3000 arrays there only according to statistics; data on the votes of deputies of the State Duma, economic registers, all data of Moscow, all data of the Ulyanovsk region.

image

There is a similar portal - this is ar.gov.ru , which is maintained by the Ministry of Economic Development. They are now simply cataloging and cataloging everything that exists. Data on the budget of the city of Moscow is openly available - on a special portal budget.mos.ru , where there is even a section for developers.

So far, the publication of open data is mandatory only for federal authorities. The process is progressing gradually. We have many laws that are not enforced. For example, federal law N 8-ФЗ - on the openness of information. God forbid, 10% of government agencies correspond to it 100%. The rest - in something in the little things - violate it. And not always consciously, but rather because of the negligence of people who maintain official sites. But the signed Charter and the adopted law on data openness indicate that working with them has already become part of state policy. Our peculiarity is that we do not know what information basically exists. For example, there is a transcript of speeches of deputies. Now it is not machine readable, but we have a machine readable version.

If you have any ideas and need help, you can write to me - I will always tell you what data you can use for your purposes and where you can get it.

What is Apps4Russia


One important task is to fuel interest in open data. For this, Apps4Russia was created - a long continuous competition for developers, which we did before the state became interested in this topic. In 2011, seven people raised their own money to make up the prize pool, and held the first competition, in which there were about fifteen substantial applications. After him, we created a non-profit partnership “Information Culture” and now we are holding a competition for the third time. Its main task is to motivate developers to turn to open data, to make them understand that they can and should be used for their projects.

Apps4Russia participated in one great project - a social card. This is an application that, by the coordinates of a mobile phone, determined which state institutions are nearby, and immediately brought their phones: DEZ, government, police station, etc. This is open data that has been collected from different sites and systematized. Recently, we held a small competition based on police data. Within its framework, several applications have appeared that help to know your district officer.

This year at Apps4Russia there is a Yandex nomination in which applications created on its technologies will compete. It has a very specific idea: Yandex is a service company that also works on open data and creates many opportunities for developers to improve the quality of their products. It is difficult to measure how many projects have earned on Yandex.Maps, but the food quality of so many has certainly improved. You can use not only Yandex.Maps, but the Yandex.Search API , API of other services .

image

In addition to the generally accepted APIs, Yandex also has technologies that are specifically designed to process the language in free form. For example, some time ago, the parser Tomita became open designed specifically for this. It is he who helps to understand the meaning of the text, for example, Yandex.News.

And with the help of the Search and the registry of hospitals, you can make a search engine for hospitals. Or create a mobile application for prosecutors or people interested in prosecutors by collecting data from all prosecutor’s sites and adding news to RSS. And sell it to the prosecutors themselves.

You can take a small piece from each data array and use it somehow. If the registry of organizations has their web addresses, you can restart the robot, collect RSS feeds and make the latest application of the Moscow City mobile application - all Moscow departments have an RSS feed. All this can be done on Yandex technologies - you just need to go to api.yandex.ru . This year, applications for Apps4Russia will end on September 16th, but there is a chance that we will extend it.

Yandex open data hackathon


September 14 - 15 in Moscow will be the first Yandex Hackathon. Two days and two nights, developers will create applications based on open government data and Yandex technologies. You can participate in it even with teams of up to five people. And you can come as a ready-made team, or you can organize on the spot.

If you can think for a long time at a competition, and then do something for him in two hours, then you need to think fast at the Hackathon. As a rule, you have to come prepared for it. Therefore, think in advance about what you will do, understand where you will look for information, and learn the API. Of course, they will help you on the spot: there will be consultants on open data and on Yandex technologies.

I want to emphasize once again that it is not necessary to immediately make a product that you will sell. You can make it part of another product. You can sell yourself - due to the fact that you are qualitatively implementing a particular piece of the project. And not necessarily the employer - it's just a job for a reputation. At the Hackathon, you can show that you can create cool things based on some information and some tools.

The main task of both Apps4Russia and Yandex Hackathon is to show that there is a lot of information and technologies around with which you can create something useful.