The prediction of the future developments of a natural phenomenon is one of the main goals of science, but it remains always a great challenge especially when the phenomenon that one is observing involves people that can have a feedback reaction on the observed quantities. This is particularly true in the case of epidemics, especially with the COVID-19 outbreak that the world is suffering in this period. We propose a novel data-driven framework for assessing the a-priori epidemic risk of a geographical area and for identifying high-risk areas within a country. Our risk index is evaluated as a function of three different components: the hazard of the disease, the exposure of the area and the vulnerability of its inhabitants. As an application, we discuss the case of COVID-19 outbreak in Italy. We characterize each of the twenty Italian regions by using available historical data on air pollution, human mobility, winter temperature, housing concentration, health care density, population size and age. We find that the epidemic risk is higher in some of the Northern regions with respect to Central and Southern Italy. The corresponding risk index shows correlations with the available official data on the number of infected individuals, patients in intensive care and deceased patients, and can help explaining why regions such as Lombardia in particular, but also Emilia-Romagna, Piemonte and Veneto, have suffered much more than the rest of the country. Although the COVID-19 outbreak started in both North (Lombardia and Veneto) and Central Italy (Lazio) almost at the same time, when the first cases were officially certified at the beginning of 2020, the disease has spread faster and with heavier consequences in regions with higher epidemic risk. Our framework can be extended and tested on other epidemic data, such as those on seasonal flu, and applied to other countries. We also present a policy model connected with our methodology, which helps policy-makers to take informed decisions.