New Master Seminar in the Summer Term: "Data Mining: Automated Collection and Analysis of Textual & Numerical Data" (AIDAHO) [22.03.24]
Starting with the summer term of 2024, the seminar "Data Mining: Automated Collection and Analysis of Textual & Numerical Data" will be offered annually for Master's students of all faculties. This seminar is part of the AIDAHO Certificate Program, but can also be recognized as an elective or portfolio module.Lecturer: Prof. Dr. Jens Vogelgesang
Hop on the data digging adventure! This hands-on course is all about the nitty-gritty of using data science in real life, exploring the goldmine of data hidden on the internet and digital platforms. We're basically swimming in data that can answer old and brand-new questions. The tricky part? A lot of this data is kind of a hot mess or just barely put together. Take website content, like press releases, as an example. To analyze this, you need to get under the hood of the website’s HTML code to find and extract the information you need. The same goes for PDFs, like annual reports, which might require a bit of optical character recognition magic (OCR) to turn images of text into actual data you can work with. Then there’s the world of semi-structured data, served up through application programming inter-faces (APIs). These APIs, like Spotify's Web API for fetching track features offer a more organized way to access data for analysis. Check out the example code below to get a feel for what we're going to do. It's like a sneak peek into grabbing data from somewhere cool like Spotify's API. If you are more inter-ested in time series data, APIs also gain access to databases for stock market trends or biological pro-cess. Once we've scraped or pulled down the data, we're going to roll up our sleeves and get into sta-tistical text analysis and number-crunching with stats to get to the bottom of our project questions.
The following topics will be treated during the course of the class:
- Import, preparation, and visualization of unstructured and semi-structured data in R
- HTML web scraping techniques
- Extraction of text and numbers from PDF files using optical character recognition
- Data pulling from Web APIs
- Statistical text analysis
- Statistical methods for analysing quantitative data
- Preparation and presentation of a scientific poster
ILIAS-Access: https://ilias.uni-hohenheim.de/goto.php?target=crs_1572531&client_id=UHOH