Heterogeneous Data Integration

Keywords: Data Integration, Data Fusion, Data Heterogeneity

Abstract:

Recent years witnessed technological evolution in various fields such as Information and Communication Technology (ICT), sensors/sensor networks, cloud infrastructures, data management techniques, and data storage solutions. This enabled industries, environments, businesses to become more and more data-oriented systems that generate and consume huge amounts of data. As a result, the amount of information generated in the world increases by 30% every year [1]. With more data contributors (i.e., various data sources and documents), the need for collaboration emerged and data management applications now needed to incorporate data from various sources and provide a uniform data interaction interface to allow users the retrieval and thereafter exploitation of data from multiple sources. For the aforementioned reasons, data integration and fusion became an intriguing topic of interest in the scientific community [1,2]. In order to enable the various applications that users require (e.g., event detection, statistical analysis, forecasting), heterogeneous data had to be first integrated and then refined in order to avoid inconsistencies such as mismatches, anomalies, and duplicates.

Mission – Main Activities:

Data integration needs to address two major challenges: (i) handling the heterogeneous data generated by differing sources in terms of schema, domain-specific constraints, varying representations; and (ii)  handling data inconsistencies such as conflicts, incompleteness, duplications, and anomalies [1]. Therefore, the thesis focuses on contributions that improve the data integration stack from the schema level (e.g., schema mapping to semantically link heterogeneous data schemas), to data deduplication, and finally data fusion to handle data incompleteness and conflicts/mismatches in order to provide data consumer application with consolidated/integrated high-quality data from heterogeneous sources.

Applicant’s Profile:

  • The ideal candidate has a master degree in computer science
  • A previous experience in data management techniques would be a plus
  • We expect outstanding analytical competence, strong interest in interdisciplinary research (Machine learning/Information retrieval), experience in software engineering (strong programming skills in Python), as well as the superior organization and communication skills
  • The candidate must have a good English level and the capacity to work autonomously

Candidate Application:

  • Application file assessment: Selection committee
  • Candidates will first be selected based on their application file.
  • Those selected after this first step will then be interviewed.
  • Application files will be evaluated based on the following criteria:
    • Grades and ranking during your Master degree, steadiness in your academic background
    • English language proficiency
    • Candidate’s ability to present her/his work and results
  • Work experience similar to an internship in a laboratory – or likewise; previously achieved research work (reports, publications)
  • CV
  • Cover letter
  • Master degree grade transcripts and ranking
  • Reference letter
    • Contact details of at least two people, from your, work environment, who can be contacted for further reference
  • The application must be sent to the following email address with the title “Doctoral application”: [email protected]
  • Application deadline: SEPTEMBER 15, 2021

Logistics:

  • Hosting laboratory: LIUPPA
  • Location: IUT de Bayonne – 2 Allée du Parc de Montaury 64600 Anglet
  • Laboratory expertise: Computer Science
  • Funding: E2S UPPA project from the university of “Pau et des Pays de l’Adour” UPPA   
  • Thesis Director: Richard Chbeir
  • Starting Date: November 1st, 2021
  • Duration: 3 years
  • Gross salary: 1 870 € / month (including extra gratification for teaching duties – 32h per year)

REFERENCES :

  1. Dong, Xin Luna, and Felix Naumann. “Data fusion: resolving data conflicts for integration.” Proceedings of the VLDB Endowment 2.2 (2009): 1654-1655.
  2. Dong, Xin Luna, Laure Berti-Equille, and Divesh Srivastava. “Data fusion: resolving conflicts from multiple sources.” Handbook of Data Quality. Springer, Berlin, Heidelberg, 2013. 293-318.