{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "dd88f680", "metadata": {}, "source": [ "# Audio Data" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9e4390a0", "metadata": {}, "source": [ "## Introduction\n", "\n", "Audio data - as recorded by smartphones or other portable devices - can carry important information about individuals' environments. This may give insights about the activity, sleep, and social interaction. However, using these data can be tricky due to privacy concerns, for example, conversations are highly identifiable. A possible solution is to compute more general characteristics (e.g. frequency) and use those instead to extract features. To address this last part, `niimpy` includes the function `extract_features_audio` to clean, downsample, and extract features from audio snippets that have been already anonymized.\n", "\n", "Audio dataframes should have the following columns (column names can be different, but in that case they must be provided as parameters):\n", "- `user`: Subject ID\n", "- `device`: Device ID\n", "- `is_silent`: Boolean value, indicates when audio is too quiet to record\n", "- `frequency`: Audio frequency in Hz\n", "- `decibels`: Audio volume in decibels\n", "\n", "Niimpy extracts the following audio features:\n", "- `audio_count_silent`: number of times when there has been some sound in the environment\n", "- `audio_count_speech`: number of times when there has been some sound in the environment that matches the range of human speech frequency (65 - 255Hz)\n", "- `audio_count_loud`: number of times when there has been some sound in the environment above 70dB\n", "- `audio_min_freq`: minimum frequency of the recorded audio snippets\n", "- `audio_max_freq`: maximum frequency of the recorded audio snippets\n", "- `audio_mean_freq`: mean frequency of the recorded audio snippets\n", "- `audio_median_freq`: median frequency of the recorded audio snippets\n", "- `audio_std_freq`: standard deviation of the frequency of the recorded audio snippets\n", "- `audio_min_db`: minimum decibels of the recorded audio snippets\n", "- `audio_max_db`: maximum decibels of the recorded audio snippets\n", "- `audio_mean_db`: mean decibels of the recorded audio snippets\n", "- `audio_median_db`: median decibels of the recorded audio snippets\n", "- `audio_std_db`: standard deviations of the recorded audio snippets decibels\n", "\n", "In the following, we will analyze audio snippets provided by `niimpy` as an example to illustrate the use of niimpy's audio preprocessing functions." ] }, { "attachments": {}, "cell_type": "markdown", "id": "1937680b", "metadata": {}, "source": [ "## 2. Read data\n", "\n", "Let's start by reading the example data provided in `niimpy`. These data have already been shaped in a format that meets the requirements of the data schema. Let's start by importing the needed modules. Firstly we will import the `niimpy` package and then we will import the module we will use (audio) and give it a short name for use convinience. " ] }, { "cell_type": "code", "execution_count": 1, "id": "8e00f1bf", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/u/24/rantahj1/unix/miniconda3/envs/niimpy/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "import niimpy\n", "from niimpy import config\n", "import niimpy.preprocessing.audio as au\n", "import pandas as pd\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "00507e0e", "metadata": {}, "source": [ "Now let's read the example data provided in `niimpy`. The example data is in `csv` format, so we need to use the `read_csv` function. When reading the data, we can specify the timezone where the data was collected. This will help us handle daylight saving times easier. We can specify the timezone with the argument **tz**. The output is a dataframe. We can also check the number of rows and columns in the dataframe." ] }, { "cell_type": "code", "execution_count": 2, "id": "aa7d80df", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(33, 7)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = niimpy.read_csv(config.MULTIUSER_AWARE_AUDIO_PATH, tz='Europe/Helsinki')\n", "data.shape" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3fb22de0", "metadata": {}, "source": [ "The data was succesfully read. We can see that there are 33 datapoints with 7 columns in the dataset. However, we do not know yet what the data really looks like, so let's have a quick look:" ] }, { "cell_type": "code", "execution_count": 3, "id": "e416e790", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdevicetimeis_silentdouble_decibelsdouble_frequencydatetime
2020-01-09 02:08:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578528e+0908449352020-01-09 02:08:03.895999908+02:00
2020-01-09 02:38:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578530e+0908987342020-01-09 02:38:03.895999908+02:00
2020-01-09 03:08:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578532e+0909917102020-01-09 03:08:03.895999908+02:00
2020-01-09 03:38:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578534e+0907790542020-01-09 03:38:03.895999908+02:00
2020-01-09 04:08:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578536e+09080122652020-01-09 04:08:03.895999908+02:00
\n", "
" ], "text/plain": [ " user device time \\\n", "2020-01-09 02:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 \n", "2020-01-09 02:38:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n", "2020-01-09 03:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578532e+09 \n", "2020-01-09 03:38:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578534e+09 \n", "2020-01-09 04:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578536e+09 \n", "\n", " is_silent double_decibels \\\n", "2020-01-09 02:08:03.895999908+02:00 0 84 \n", "2020-01-09 02:38:03.895999908+02:00 0 89 \n", "2020-01-09 03:08:03.895999908+02:00 0 99 \n", "2020-01-09 03:38:03.895999908+02:00 0 77 \n", "2020-01-09 04:08:03.895999908+02:00 0 80 \n", "\n", " double_frequency \\\n", "2020-01-09 02:08:03.895999908+02:00 4935 \n", "2020-01-09 02:38:03.895999908+02:00 8734 \n", "2020-01-09 03:08:03.895999908+02:00 1710 \n", "2020-01-09 03:38:03.895999908+02:00 9054 \n", "2020-01-09 04:08:03.895999908+02:00 12265 \n", "\n", " datetime \n", "2020-01-09 02:08:03.895999908+02:00 2020-01-09 02:08:03.895999908+02:00 \n", "2020-01-09 02:38:03.895999908+02:00 2020-01-09 02:38:03.895999908+02:00 \n", "2020-01-09 03:08:03.895999908+02:00 2020-01-09 03:08:03.895999908+02:00 \n", "2020-01-09 03:38:03.895999908+02:00 2020-01-09 03:38:03.895999908+02:00 \n", "2020-01-09 04:08:03.895999908+02:00 2020-01-09 04:08:03.895999908+02:00 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": 4, "id": "260eccd7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdevicetimeis_silentdouble_decibelsdouble_frequencydatetime
2019-08-13 15:02:17.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565698e+0914429142019-08-13 15:02:17.657999992+03:00
2019-08-13 15:28:59.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565699e+0914971952019-08-13 15:28:59.657999992+03:00
2019-08-13 15:59:01.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565701e+09055912019-08-13 15:59:01.657999992+03:00
2019-08-13 16:29:03.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565703e+0907638532019-08-13 16:29:03.657999992+03:00
2019-08-13 16:59:05.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565705e+0908474192019-08-13 16:59:05.657999992+03:00
\n", "
" ], "text/plain": [ " user device time \\\n", "2019-08-13 15:02:17.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565698e+09 \n", "2019-08-13 15:28:59.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565699e+09 \n", "2019-08-13 15:59:01.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565701e+09 \n", "2019-08-13 16:29:03.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565703e+09 \n", "2019-08-13 16:59:05.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565705e+09 \n", "\n", " is_silent double_decibels \\\n", "2019-08-13 15:02:17.657999992+03:00 1 44 \n", "2019-08-13 15:28:59.657999992+03:00 1 49 \n", "2019-08-13 15:59:01.657999992+03:00 0 55 \n", "2019-08-13 16:29:03.657999992+03:00 0 76 \n", "2019-08-13 16:59:05.657999992+03:00 0 84 \n", "\n", " double_frequency \\\n", "2019-08-13 15:02:17.657999992+03:00 2914 \n", "2019-08-13 15:28:59.657999992+03:00 7195 \n", "2019-08-13 15:59:01.657999992+03:00 91 \n", "2019-08-13 16:29:03.657999992+03:00 3853 \n", "2019-08-13 16:59:05.657999992+03:00 7419 \n", "\n", " datetime \n", "2019-08-13 15:02:17.657999992+03:00 2019-08-13 15:02:17.657999992+03:00 \n", "2019-08-13 15:28:59.657999992+03:00 2019-08-13 15:28:59.657999992+03:00 \n", "2019-08-13 15:59:01.657999992+03:00 2019-08-13 15:59:01.657999992+03:00 \n", "2019-08-13 16:29:03.657999992+03:00 2019-08-13 16:29:03.657999992+03:00 \n", "2019-08-13 16:59:05.657999992+03:00 2019-08-13 16:59:05.657999992+03:00 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.tail()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0956889d", "metadata": {}, "source": [ "By exploring the head and tail of the dataframe we can form an idea of its entirety. From the data, we can see that:\n", "\n", "- rows are observations, indexed by timestamps, i.e. each row represents a snippet that has been recorded at a given time and date\n", "- columns are characteristics for each observation, for example, the user whose data we are analyzing\n", "- there are at least two different users in the dataframe\n", "- there are two main columns: `decibels` and `frequency`.\n", "\n", "In fact, we can check the first three elements for each user" ] }, { "cell_type": "code", "execution_count": 5, "id": "aa599198", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdevicetimeis_silentdouble_decibelsdouble_frequencydatetime
2020-01-09 02:08:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578528e+0908449352020-01-09 02:08:03.895999908+02:00
2020-01-09 02:38:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578530e+0908987342020-01-09 02:38:03.895999908+02:00
2020-01-09 03:08:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578532e+0909917102020-01-09 03:08:03.895999908+02:00
2019-08-13 07:28:27.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565671e+0905177352019-08-13 07:28:27.657999992+03:00
2019-08-13 07:58:29.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565672e+09090136092019-08-13 07:58:29.657999992+03:00
2019-08-13 08:28:31.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565674e+0908176902019-08-13 08:28:31.657999992+03:00
\n", "
" ], "text/plain": [ " user device time \\\n", "2020-01-09 02:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 \n", "2020-01-09 02:38:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n", "2020-01-09 03:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578532e+09 \n", "2019-08-13 07:28:27.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565671e+09 \n", "2019-08-13 07:58:29.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565672e+09 \n", "2019-08-13 08:28:31.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565674e+09 \n", "\n", " is_silent double_decibels \\\n", "2020-01-09 02:08:03.895999908+02:00 0 84 \n", "2020-01-09 02:38:03.895999908+02:00 0 89 \n", "2020-01-09 03:08:03.895999908+02:00 0 99 \n", "2019-08-13 07:28:27.657999992+03:00 0 51 \n", "2019-08-13 07:58:29.657999992+03:00 0 90 \n", "2019-08-13 08:28:31.657999992+03:00 0 81 \n", "\n", " double_frequency \\\n", "2020-01-09 02:08:03.895999908+02:00 4935 \n", "2020-01-09 02:38:03.895999908+02:00 8734 \n", "2020-01-09 03:08:03.895999908+02:00 1710 \n", "2019-08-13 07:28:27.657999992+03:00 7735 \n", "2019-08-13 07:58:29.657999992+03:00 13609 \n", "2019-08-13 08:28:31.657999992+03:00 7690 \n", "\n", " datetime \n", "2020-01-09 02:08:03.895999908+02:00 2020-01-09 02:08:03.895999908+02:00 \n", "2020-01-09 02:38:03.895999908+02:00 2020-01-09 02:38:03.895999908+02:00 \n", "2020-01-09 03:08:03.895999908+02:00 2020-01-09 03:08:03.895999908+02:00 \n", "2019-08-13 07:28:27.657999992+03:00 2019-08-13 07:28:27.657999992+03:00 \n", "2019-08-13 07:58:29.657999992+03:00 2019-08-13 07:58:29.657999992+03:00 \n", "2019-08-13 08:28:31.657999992+03:00 2019-08-13 08:28:31.657999992+03:00 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop_duplicates(['user','time']).groupby('user').head(3)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "beac76e5", "metadata": {}, "source": [ "Sometimes the data may come in a disordered manner, so just to make sure, let's order the dataframe and compare the results. We will use the columns \"user\" and \"datetime\" since we would like to order the information according to firstly, participants, and then, by time in order of happening. Luckily, in our dataframe, the index and datetime are the same." ] }, { "cell_type": "code", "execution_count": 6, "id": "560cd6ad", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdevicetimeis_silentdouble_decibelsdouble_frequencydatetime
2019-08-13 07:28:27.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565671e+0905177352019-08-13 07:28:27.657999992+03:00
2019-08-13 07:58:29.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565672e+09090136092019-08-13 07:58:29.657999992+03:00
2019-08-13 08:28:31.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565674e+0908176902019-08-13 08:28:31.657999992+03:00
2020-01-09 02:08:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578528e+0908449352020-01-09 02:08:03.895999908+02:00
2020-01-09 02:38:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578530e+0908987342020-01-09 02:38:03.895999908+02:00
2020-01-09 03:08:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578532e+0909917102020-01-09 03:08:03.895999908+02:00
\n", "
" ], "text/plain": [ " user device time \\\n", "2019-08-13 07:28:27.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565671e+09 \n", "2019-08-13 07:58:29.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565672e+09 \n", "2019-08-13 08:28:31.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565674e+09 \n", "2020-01-09 02:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 \n", "2020-01-09 02:38:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n", "2020-01-09 03:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578532e+09 \n", "\n", " is_silent double_decibels \\\n", "2019-08-13 07:28:27.657999992+03:00 0 51 \n", "2019-08-13 07:58:29.657999992+03:00 0 90 \n", "2019-08-13 08:28:31.657999992+03:00 0 81 \n", "2020-01-09 02:08:03.895999908+02:00 0 84 \n", "2020-01-09 02:38:03.895999908+02:00 0 89 \n", "2020-01-09 03:08:03.895999908+02:00 0 99 \n", "\n", " double_frequency \\\n", "2019-08-13 07:28:27.657999992+03:00 7735 \n", "2019-08-13 07:58:29.657999992+03:00 13609 \n", "2019-08-13 08:28:31.657999992+03:00 7690 \n", "2020-01-09 02:08:03.895999908+02:00 4935 \n", "2020-01-09 02:38:03.895999908+02:00 8734 \n", "2020-01-09 03:08:03.895999908+02:00 1710 \n", "\n", " datetime \n", "2019-08-13 07:28:27.657999992+03:00 2019-08-13 07:28:27.657999992+03:00 \n", "2019-08-13 07:58:29.657999992+03:00 2019-08-13 07:58:29.657999992+03:00 \n", "2019-08-13 08:28:31.657999992+03:00 2019-08-13 08:28:31.657999992+03:00 \n", "2020-01-09 02:08:03.895999908+02:00 2020-01-09 02:08:03.895999908+02:00 \n", "2020-01-09 02:38:03.895999908+02:00 2020-01-09 02:38:03.895999908+02:00 \n", "2020-01-09 03:08:03.895999908+02:00 2020-01-09 03:08:03.895999908+02:00 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.sort_values(by=['user', 'datetime'], inplace=True)\n", "data.drop_duplicates(['user','time']).groupby('user').head(3)" ] }, { "cell_type": "markdown", "id": "b4988507", "metadata": {}, "source": [ "The main column names in our dataframe do not match the Niimpy schema. We could provide these column names as parameters but it easier to rename them here." ] }, { "cell_type": "code", "execution_count": 7, "id": "5f17abe7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdevicetimeis_silentdouble_decibelsdouble_frequencydatetime
2019-08-13 07:28:27.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565671e+0905177352019-08-13 07:28:27.657999992+03:00
2019-08-13 07:58:29.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565672e+09090136092019-08-13 07:58:29.657999992+03:00
2019-08-13 08:28:31.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565674e+0908176902019-08-13 08:28:31.657999992+03:00
2019-08-13 08:58:33.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565676e+0905883472019-08-13 08:58:33.657999992+03:00
2019-08-13 09:28:35.657999992+03:00iGyXetHE3S8uCq9vueHh3zVs1.565678e+09136135922019-08-13 09:28:35.657999992+03:00
\n", "
" ], "text/plain": [ " user device time \\\n", "2019-08-13 07:28:27.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565671e+09 \n", "2019-08-13 07:58:29.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565672e+09 \n", "2019-08-13 08:28:31.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565674e+09 \n", "2019-08-13 08:58:33.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565676e+09 \n", "2019-08-13 09:28:35.657999992+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565678e+09 \n", "\n", " is_silent double_decibels \\\n", "2019-08-13 07:28:27.657999992+03:00 0 51 \n", "2019-08-13 07:58:29.657999992+03:00 0 90 \n", "2019-08-13 08:28:31.657999992+03:00 0 81 \n", "2019-08-13 08:58:33.657999992+03:00 0 58 \n", "2019-08-13 09:28:35.657999992+03:00 1 36 \n", "\n", " double_frequency \\\n", "2019-08-13 07:28:27.657999992+03:00 7735 \n", "2019-08-13 07:58:29.657999992+03:00 13609 \n", "2019-08-13 08:28:31.657999992+03:00 7690 \n", "2019-08-13 08:58:33.657999992+03:00 8347 \n", "2019-08-13 09:28:35.657999992+03:00 13592 \n", "\n", " datetime \n", "2019-08-13 07:28:27.657999992+03:00 2019-08-13 07:28:27.657999992+03:00 \n", "2019-08-13 07:58:29.657999992+03:00 2019-08-13 07:58:29.657999992+03:00 \n", "2019-08-13 08:28:31.657999992+03:00 2019-08-13 08:28:31.657999992+03:00 \n", "2019-08-13 08:58:33.657999992+03:00 2019-08-13 08:58:33.657999992+03:00 \n", "2019-08-13 09:28:35.657999992+03:00 2019-08-13 09:28:35.657999992+03:00 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = data.rename(columns={'decibels': 'decibels', 'frequency': 'frequency'})\n", "data.head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "d72f467c", "metadata": {}, "source": [ "Ok, it seems like our dataframe was in order. We can start extracting features. However, we need to understand the data format requirements first.\n", "\n", "## * TIP! Data format requirements (or what should our data look like)\n", "\n", "Data can take other shapes and formats. However, the `niimpy` data schema requires it to be in a certain shape. This means the dataframe needs to have at least the following characteristics:\n", "1. One row per call. Each row should store information about one call only\n", "2. Each row's index should be a timestamp\n", "3. The following five columns are required: \n", " - index: date and time when the event happened (timestamp)\n", " - user: stores the user name whose data is analyzed. Each user should have a unique name or hash (i.e. one hash for each unique user)\n", " - is_silent: stores whether the decibel level is above a set threshold (usually 50dB)\n", " - decibels: stores the decibels of the recorded snippet\n", " - frequency: the frequency of the recorded snippet in Hz\n", " - NOTE: most of our audio examples come from data recorded with the Aware Framework, if you want to know more about the frequency and decibels, please read https://github.com/denzilferreira/com.aware.plugin.ambient_noise\n", "4. Additional columns are allowed.\n", "5. The names of the columns do not need to be exactly \"user\", \"is_silent\", \"decibels\" or \"frequency\" as we can pass our own names in an argument.\n", "\n" ] }, { "cell_type": "markdown", "id": "b8a7a20d", "metadata": {}, "source": [ "Column names in our data do not match the Niimpy schema. We could provide these column names as parameters to niimpy functions, but it is simpler to rename them here." ] }, { "cell_type": "code", "execution_count": 8, "id": "9436998e", "metadata": {}, "outputs": [], "source": [ "data = data.rename(columns={'double_decibels': 'decibels', 'double_frequency': 'frequency'})" ] }, { "cell_type": "markdown", "id": "f2e6e2d6", "metadata": {}, "source": [ "Below is an example of a dataframe that complies with these minimum requirements" ] }, { "cell_type": "code", "execution_count": 9, "id": "8c66c6b3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
useris_silentdecibelsfrequency
2019-08-13 07:28:27.657999992+03:00iGyXetHE3S8u0517735
2019-08-13 07:58:29.657999992+03:00iGyXetHE3S8u09013609
2019-08-13 08:28:31.657999992+03:00iGyXetHE3S8u0817690
\n", "
" ], "text/plain": [ " user is_silent decibels \\\n", "2019-08-13 07:28:27.657999992+03:00 iGyXetHE3S8u 0 51 \n", "2019-08-13 07:58:29.657999992+03:00 iGyXetHE3S8u 0 90 \n", "2019-08-13 08:28:31.657999992+03:00 iGyXetHE3S8u 0 81 \n", "\n", " frequency \n", "2019-08-13 07:28:27.657999992+03:00 7735 \n", "2019-08-13 07:58:29.657999992+03:00 13609 \n", "2019-08-13 08:28:31.657999992+03:00 7690 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_dataschema = data[['user','is_silent','decibels','frequency']]\n", "example_dataschema.head(3)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7508a0bf", "metadata": {}, "source": [ "## 4. Extracting features\n", "There are two ways to extract features. We could use each function separately or we could use `niimpy`'s ready-made wrapper. Both ways will require us to specify arguments to pass to the functions/wrapper in order to customize the way the functions work. These arguments are specified in dictionaries. Let's first understand how to extract features using stand-alone functions.\n", "\n", "### 4.1 Extract features using stand-alone functions\n", "We can use `niimpy`'s functions to compute communication features. Each function will require two inputs:\n", "- (mandatory) dataframe that must comply with the minimum requirements (see '* TIP! Data requirements above)\n", "- (optional) arguments for stand-alone functions\n", "\n", "#### 4.1.1 The argument dictionary for stand-alone functions (or how we specify the way a function works)\n", "We can input two types of arguments to customize the way a stand-alone function works:\n", "- the name of the columns to be preprocessed: Since the dataframe may have different columns, we need to specify which column has the data we would like to be preprocessed. To do so, we can simply pass the name of the column to the argument `audio_column_name`. \n", "\n", "- the way we resample: resampling options are specified in `niimpy` as a dictionary. `niimpy`'s resampling and aggregating relies on `pandas.DataFrame.resample`, so mastering the use of this pandas function will help us greatly in `niimpy`'s preprocessing. Please familiarize yourself with the pandas resample function before continuing. \n", " Briefly, to use the `pandas.DataFrame.resample` function, we need a rule. This rule states the intervals we would like to use to resample our data (e.g., 15-seconds, 30-minutes, 1-hour). Neverthless, we can input more details into the function to specify the exact sampling we would like. For example, we could use the *close* argument if we would like to specify which side of the interval is closed, or we could use the *offset* argument if we would like to start our binning with an offset, etc. There are plenty of options to use this command, so we strongly recommend having `pandas.DataFrame.resample` documentation at hand. All features for the `pandas.DataFrame.resample` will be specified in a dictionary where keys are the arguments' names for the `pandas.DataFrame.resample`, and the dictionary's values are the values for each of these selected arguments. This dictionary will be passed as a value to the key `resample_args` in `niimpy`.\n", "\n", "Let's see some examples of these parameters:" ] }, { "cell_type": "markdown", "id": "be26e793", "metadata": {}, "source": [ "```python\n", "au.audio_count_loud(data, audio_column_name = \"frequency\", resample_args = {\"rule\":\"1D\"})\n", "au.audio_count_loud(data, audio_column_name = \"random_name\", resample_args = {\"rule\":\"30min\"})\n", "au.audio_count_loud(data, audio_column_name = \"other_name\", resample_args = {\"rule\":\"45T\",\"origin\":\"end\"})\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "id": "393cd2dd", "metadata": {}, "source": [ "Here, we have three basic feature dictionaries. \n", "\n", "- The first example will analyze the data stored in the column `frequency` in our dataframe. The data will be binned in one day periods\n", "- The second example will analyze the data stored in the column `random_name` in our dataframe. The data will be aggregated in 30-minutes bins\n", "- The third example will analyze the data stored in the column `other_name` in our dataframe. The data will be binned in 45-minutes bins, but the binning will start from the last timestamp in the dataframe. \n", "\n", "**Default values:** if no arguments are passed, `niimpy`'s will aggregate the data in 30-min bins, and will select the audio_column_name according to the most suitable column. For example, if we are computing the minimum frequency, `niimpy` will select *frequency* as the column name. " ] }, { "cell_type": "markdown", "id": "1d64934a", "metadata": {}, "source": [ "#### 4.1.2 Using the functions\n", "Now that we understand how the functions are customized, it is time we compute our first audio feature. Suppose that we are interested in extracting the total number of times our recordings were loud every 50 minutes. We will need `niimpy`'s `audio_count_loud` function." ] }, { "cell_type": "code", "execution_count": 10, "id": "98a0af37", "metadata": {}, "outputs": [], "source": [ "my_loud_times = au.audio_count_loud(\n", " data,\n", " audio_column_name = \"decibels\",\n", " resample_args = {\"rule\":\"50T\"}\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3f6a607d", "metadata": {}, "source": [ "Let's look at some values for one of the subjects." ] }, { "cell_type": "code", "execution_count": 11, "id": "ae8260cb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
useraudio_count_louddevice
2020-01-09 01:40:00+02:00jd9INuQ5BBlW13p83yASkOb_B
2020-01-09 02:30:00+02:00jd9INuQ5BBlW23p83yASkOb_B
2020-01-09 03:20:00+02:00jd9INuQ5BBlW23p83yASkOb_B
2020-01-09 04:10:00+02:00jd9INuQ5BBlW03p83yASkOb_B
2020-01-09 05:00:00+02:00jd9INuQ5BBlW13p83yASkOb_B
2020-01-09 05:50:00+02:00jd9INuQ5BBlW13p83yASkOb_B
2020-01-09 06:40:00+02:00jd9INuQ5BBlW1OWd1Uau8POix
2020-01-09 07:30:00+02:00jd9INuQ5BBlW0OWd1Uau8POix
2020-01-09 08:20:00+02:00jd9INuQ5BBlW1OWd1Uau8POix
2020-01-09 09:10:00+02:00jd9INuQ5BBlW1OWd1Uau8POix
2020-01-09 10:00:00+02:00jd9INuQ5BBlW2OWd1Uau8POix
\n", "
" ], "text/plain": [ " user audio_count_loud device\n", "2020-01-09 01:40:00+02:00 jd9INuQ5BBlW 1 3p83yASkOb_B\n", "2020-01-09 02:30:00+02:00 jd9INuQ5BBlW 2 3p83yASkOb_B\n", "2020-01-09 03:20:00+02:00 jd9INuQ5BBlW 2 3p83yASkOb_B\n", "2020-01-09 04:10:00+02:00 jd9INuQ5BBlW 0 3p83yASkOb_B\n", "2020-01-09 05:00:00+02:00 jd9INuQ5BBlW 1 3p83yASkOb_B\n", "2020-01-09 05:50:00+02:00 jd9INuQ5BBlW 1 3p83yASkOb_B\n", "2020-01-09 06:40:00+02:00 jd9INuQ5BBlW 1 OWd1Uau8POix\n", "2020-01-09 07:30:00+02:00 jd9INuQ5BBlW 0 OWd1Uau8POix\n", "2020-01-09 08:20:00+02:00 jd9INuQ5BBlW 1 OWd1Uau8POix\n", "2020-01-09 09:10:00+02:00 jd9INuQ5BBlW 1 OWd1Uau8POix\n", "2020-01-09 10:00:00+02:00 jd9INuQ5BBlW 2 OWd1Uau8POix" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_loud_times[my_loud_times[\"user\"]==\"jd9INuQ5BBlW\"]" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6ffd53b7", "metadata": {}, "source": [ "Let's remember how the original data looks like for this subject" ] }, { "cell_type": "code", "execution_count": 12, "id": "e085424f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdevicetimeis_silentdecibelsfrequencydatetime
2020-01-09 02:08:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578528e+0908449352020-01-09 02:08:03.895999908+02:00
2020-01-09 02:38:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578530e+0908987342020-01-09 02:38:03.895999908+02:00
2020-01-09 03:08:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578532e+0909917102020-01-09 03:08:03.895999908+02:00
2020-01-09 03:38:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578534e+0907790542020-01-09 03:38:03.895999908+02:00
2020-01-09 04:08:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578536e+09080122652020-01-09 04:08:03.895999908+02:00
2020-01-09 04:38:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578537e+0905272812020-01-09 04:38:03.895999908+02:00
2020-01-09 05:08:03.895999908+02:00jd9INuQ5BBlW3p83yASkOb_B1.578539e+09063144082020-01-09 05:08:03.895999908+02:00
\n", "
" ], "text/plain": [ " user device time \\\n", "2020-01-09 02:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 \n", "2020-01-09 02:38:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n", "2020-01-09 03:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578532e+09 \n", "2020-01-09 03:38:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578534e+09 \n", "2020-01-09 04:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578536e+09 \n", "2020-01-09 04:38:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578537e+09 \n", "2020-01-09 05:08:03.895999908+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578539e+09 \n", "\n", " is_silent decibels frequency \\\n", "2020-01-09 02:08:03.895999908+02:00 0 84 4935 \n", "2020-01-09 02:38:03.895999908+02:00 0 89 8734 \n", "2020-01-09 03:08:03.895999908+02:00 0 99 1710 \n", "2020-01-09 03:38:03.895999908+02:00 0 77 9054 \n", "2020-01-09 04:08:03.895999908+02:00 0 80 12265 \n", "2020-01-09 04:38:03.895999908+02:00 0 52 7281 \n", "2020-01-09 05:08:03.895999908+02:00 0 63 14408 \n", "\n", " datetime \n", "2020-01-09 02:08:03.895999908+02:00 2020-01-09 02:08:03.895999908+02:00 \n", "2020-01-09 02:38:03.895999908+02:00 2020-01-09 02:38:03.895999908+02:00 \n", "2020-01-09 03:08:03.895999908+02:00 2020-01-09 03:08:03.895999908+02:00 \n", "2020-01-09 03:38:03.895999908+02:00 2020-01-09 03:38:03.895999908+02:00 \n", "2020-01-09 04:08:03.895999908+02:00 2020-01-09 04:08:03.895999908+02:00 \n", "2020-01-09 04:38:03.895999908+02:00 2020-01-09 04:38:03.895999908+02:00 \n", "2020-01-09 05:08:03.895999908+02:00 2020-01-09 05:08:03.895999908+02:00 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[data[\"user\"]==\"jd9INuQ5BBlW\"].head(7)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "dbea7c11", "metadata": {}, "source": [ "We see that the bins are indeed 50-minutes bins, however, they are adjusted to fixed, predetermined intervals, i.e. the bin does not start on the time of the first datapoint. Instead, `pandas` starts the binning at 00:00:00 of everyday and counts 50-minutes intervals from there. \n", "\n", "If we want the binning to start from the first datapoint in our dataset, we need the origin parameter and a for loop." ] }, { "cell_type": "code", "execution_count": 13, "id": "d7ff80f4", "metadata": {}, "outputs": [], "source": [ "users = list(data['user'].unique())\n", "results = []\n", "for user in users:\n", " start_time = data[data[\"user\"]==user].index.min()\n", " results.append(au.audio_count_loud(\n", " data[data[\"user\"]==user],\n", " audio_column_name=\"decibels\",\n", " resample_args={\"rule\":\"50T\"}\n", " ))\n", "my_loud_times = pd.concat(results)" ] }, { "cell_type": "code", "execution_count": 14, "id": "427ab240", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
useraudio_count_louddevice
2019-08-13 07:30:00+03:00iGyXetHE3S8u1Cq9vueHh3zVs
2019-08-13 08:20:00+03:00iGyXetHE3S8u1Cq9vueHh3zVs
2019-08-13 09:10:00+03:00iGyXetHE3S8u1Cq9vueHh3zVs
2019-08-13 10:00:00+03:00iGyXetHE3S8u1Cq9vueHh3zVs
2019-08-13 10:50:00+03:00iGyXetHE3S8u2Cq9vueHh3zVs
2019-08-13 11:40:00+03:00iGyXetHE3S8u1Cq9vueHh3zVs
2019-08-13 12:30:00+03:00iGyXetHE3S8u0Cq9vueHh3zVs
2019-08-13 13:20:00+03:00iGyXetHE3S8u0Cq9vueHh3zVs
2019-08-13 14:10:00+03:00iGyXetHE3S8u1Cq9vueHh3zVs
2019-08-13 15:00:00+03:00iGyXetHE3S8u0Cq9vueHh3zVs
2019-08-13 15:50:00+03:00iGyXetHE3S8u1Cq9vueHh3zVs
2019-08-13 16:40:00+03:00iGyXetHE3S8u1Cq9vueHh3zVs
2020-01-09 01:40:00+02:00jd9INuQ5BBlW13p83yASkOb_B
2020-01-09 02:30:00+02:00jd9INuQ5BBlW23p83yASkOb_B
2020-01-09 03:20:00+02:00jd9INuQ5BBlW23p83yASkOb_B
2020-01-09 04:10:00+02:00jd9INuQ5BBlW03p83yASkOb_B
2020-01-09 05:00:00+02:00jd9INuQ5BBlW13p83yASkOb_B
2020-01-09 05:50:00+02:00jd9INuQ5BBlW13p83yASkOb_B
2020-01-09 06:40:00+02:00jd9INuQ5BBlW1OWd1Uau8POix
2020-01-09 07:30:00+02:00jd9INuQ5BBlW0OWd1Uau8POix
2020-01-09 08:20:00+02:00jd9INuQ5BBlW1OWd1Uau8POix
2020-01-09 09:10:00+02:00jd9INuQ5BBlW1OWd1Uau8POix
2020-01-09 10:00:00+02:00jd9INuQ5BBlW2OWd1Uau8POix
\n", "
" ], "text/plain": [ " user audio_count_loud device\n", "2019-08-13 07:30:00+03:00 iGyXetHE3S8u 1 Cq9vueHh3zVs\n", "2019-08-13 08:20:00+03:00 iGyXetHE3S8u 1 Cq9vueHh3zVs\n", "2019-08-13 09:10:00+03:00 iGyXetHE3S8u 1 Cq9vueHh3zVs\n", "2019-08-13 10:00:00+03:00 iGyXetHE3S8u 1 Cq9vueHh3zVs\n", "2019-08-13 10:50:00+03:00 iGyXetHE3S8u 2 Cq9vueHh3zVs\n", "2019-08-13 11:40:00+03:00 iGyXetHE3S8u 1 Cq9vueHh3zVs\n", "2019-08-13 12:30:00+03:00 iGyXetHE3S8u 0 Cq9vueHh3zVs\n", "2019-08-13 13:20:00+03:00 iGyXetHE3S8u 0 Cq9vueHh3zVs\n", "2019-08-13 14:10:00+03:00 iGyXetHE3S8u 1 Cq9vueHh3zVs\n", "2019-08-13 15:00:00+03:00 iGyXetHE3S8u 0 Cq9vueHh3zVs\n", "2019-08-13 15:50:00+03:00 iGyXetHE3S8u 1 Cq9vueHh3zVs\n", "2019-08-13 16:40:00+03:00 iGyXetHE3S8u 1 Cq9vueHh3zVs\n", "2020-01-09 01:40:00+02:00 jd9INuQ5BBlW 1 3p83yASkOb_B\n", "2020-01-09 02:30:00+02:00 jd9INuQ5BBlW 2 3p83yASkOb_B\n", "2020-01-09 03:20:00+02:00 jd9INuQ5BBlW 2 3p83yASkOb_B\n", "2020-01-09 04:10:00+02:00 jd9INuQ5BBlW 0 3p83yASkOb_B\n", "2020-01-09 05:00:00+02:00 jd9INuQ5BBlW 1 3p83yASkOb_B\n", "2020-01-09 05:50:00+02:00 jd9INuQ5BBlW 1 3p83yASkOb_B\n", "2020-01-09 06:40:00+02:00 jd9INuQ5BBlW 1 OWd1Uau8POix\n", "2020-01-09 07:30:00+02:00 jd9INuQ5BBlW 0 OWd1Uau8POix\n", "2020-01-09 08:20:00+02:00 jd9INuQ5BBlW 1 OWd1Uau8POix\n", "2020-01-09 09:10:00+02:00 jd9INuQ5BBlW 1 OWd1Uau8POix\n", "2020-01-09 10:00:00+02:00 jd9INuQ5BBlW 2 OWd1Uau8POix" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_loud_times" ] }, { "attachments": {}, "cell_type": "markdown", "id": "41b3cbd2", "metadata": {}, "source": [ "### 4.2 Extract features using the wrapper\n", "We can use `niimpy`'s ready-made wrapper to extract one or several features at the same time. The wrapper will require two inputs:\n", "- (mandatory) dataframe that must comply with the minimum requirements (see '* TIP! Data requirements above)\n", "- (optional) an argument dictionary for wrapper\n", "\n", "#### 4.2.1 The argument dictionary for wrapper (or how we specify the way the wrapper works)\n", "The argument dictionary contains the arguments for each stand-alone function we would like to employ. Its keys are the feature functions we want to compute. Its values are argument dictionaries created for each stand-alone function we will employ. \n", "Let's see some examples of wrapper dictionaries:" ] }, { "cell_type": "code", "execution_count": 15, "id": "87d9d44d", "metadata": {}, "outputs": [], "source": [ "wrapper_features1 = {au.audio_count_loud:{\"audio_column_name\":\"decibels\",\"resample_args\":{\"rule\":\"1D\"}},\n", " au.audio_max_freq:{\"audio_column_name\":\"frequency\",\"resample_args\":{\"rule\":\"1D\"}}}" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7a67b446", "metadata": {}, "source": [ "- `wrapper_features1` will be used to analyze two features, `audio_count_loud` and `audio_max_freq`. For the feature audio_count_loud, we will use the data stored in the column `decibels` in our dataframe and the data will be binned in one day periods. For the feature audio_max_freq, we will use the data stored in the column `frequency` in our dataframe and the data will be binned in one day periods. " ] }, { "cell_type": "code", "execution_count": 16, "id": "d3332573", "metadata": {}, "outputs": [], "source": [ "wrapper_features2 = {au.audio_mean_db:{\"audio_column_name\":\"random_name\",\"resample_args\":{\"rule\":\"1D\"}},\n", " au.audio_count_speech:{\"audio_column_name\":\"decibels\", \"audio_freq_name\":\"frequency\", \"resample_args\":{\"rule\":\"5H\",\"offset\":\"5min\"}}}" ] }, { "attachments": {}, "cell_type": "markdown", "id": "205c28ba", "metadata": {}, "source": [ "- `wrapper_features2` will be used to analyze two features, `audio_mean_db` and `audio_count_speech`. For the feature audio_mean_db, we will use the data stored in the column `random_name` in our dataframe and the data will be binned in one day periods. For the feature audio_count_speech, we will use the data stored in the column `decibels` in our dataframe and the data will be binned in 5-hour periods with a 5-minute offset. Note that for this feature we will also need another column named \"audio_freq_column\", this is because the speech is not only defined by the amplitude of the recording, but the frequency range. " ] }, { "cell_type": "code", "execution_count": 17, "id": "a2570c5b", "metadata": {}, "outputs": [], "source": [ "wrapper_features3 = {au.audio_mean_db:{\"audio_column_name\":\"one_name\",\"resample_args\":{\"rule\":\"1D\",\"offset\":\"5min\"}},\n", " au.audio_min_freq:{\"audio_column_name\":\"one_name\",\"resample_args\":{\"rule\":\"5H\"}},\n", " au.audio_count_silent:{\"audio_column_name\":\"another_name\",\"resample_args\":{\"rule\":\"30T\",\"origin\":\"end_day\"}}}" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1377bc9d", "metadata": {}, "source": [ "- `wrapper_features3` will be used to analyze three features, `audio_mean_db`, `audio_min_freq`, and `audio_count_silent`. For the feature audio_mean_db, we will use the data stored in the column `one_name` and the data will be binned in one day periods with a 5-min offset. For the feature audio_min_freq, we will use the data stored in the column `one_name` in our dataframe and the data will be binned in 5-hour periods. Finally, for the feature audio_count_silent, we will use the data stored in the column `another_name` in our dataframe and the data will be binned in 30-minute periods and the origin of the bins will be the ceiling midnight of the last day.\n", "\n", "**Default values:** if no arguments are passed, `niimpy`'s default values are either \"decibels\", \"frequency\", or \"is_silent\" for the communication_column_name, and 30-min aggregation bins. The column name depends on the function to be called. Moreover, the wrapper will compute all the available functions in absence of the argument dictionary. \n", "\n", "#### 4.2.2 Using the wrapper\n", "Now that we understand how the wrapper is customized, it is time we compute our first communication feature using the wrapper. Suppose that we are interested in extracting the audio_count_loud duration every 50 minutes. We will need `niimpy`'s `extract_features_audio` function, the data, and we will also need to create a dictionary to customize our function. Let's create the dictionary first" ] }, { "cell_type": "code", "execution_count": 18, "id": "1a16011f", "metadata": {}, "outputs": [], "source": [ "wrapper_features1 = {au.audio_count_loud:{\"audio_column_name\":\"decibels\",\"resample_args\":{\"rule\":\"50T\"}}}" ] }, { "attachments": {}, "cell_type": "markdown", "id": "d8ac128e", "metadata": {}, "source": [ "Now, let's use the wrapper" ] }, { "cell_type": "code", "execution_count": 19, "id": "24f453c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdeviceaudio_count_loud
2019-08-13 07:30:00+03:00iGyXetHE3S8uCq9vueHh3zVs1
2019-08-13 08:20:00+03:00iGyXetHE3S8uCq9vueHh3zVs1
2019-08-13 09:10:00+03:00iGyXetHE3S8uCq9vueHh3zVs1
2019-08-13 10:00:00+03:00iGyXetHE3S8uCq9vueHh3zVs1
2019-08-13 10:50:00+03:00iGyXetHE3S8uCq9vueHh3zVs2
\n", "
" ], "text/plain": [ " user device audio_count_loud\n", "2019-08-13 07:30:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1\n", "2019-08-13 08:20:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1\n", "2019-08-13 09:10:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1\n", "2019-08-13 10:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1\n", "2019-08-13 10:50:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results_wrapper = au.extract_features_audio(data, features=wrapper_features1)\n", "results_wrapper.head(5)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3816fc21", "metadata": {}, "source": [ "Our first attempt was succesful. Now, let's try something more. Let's assume we want to compute the audio_count_loud and audio_min_freq in 1-hour bins." ] }, { "cell_type": "code", "execution_count": 20, "id": "0906693e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdeviceaudio_count_loudaudio_min_freq
2019-08-13 07:00:00+03:00iGyXetHE3S8uCq9vueHh3zVs17735.0
2019-08-13 08:00:00+03:00iGyXetHE3S8uCq9vueHh3zVs17690.0
2019-08-13 09:00:00+03:00iGyXetHE3S8uCq9vueHh3zVs1756.0
2019-08-13 10:00:00+03:00iGyXetHE3S8uCq9vueHh3zVs23059.0
2019-08-13 11:00:00+03:00iGyXetHE3S8uCq9vueHh3zVs212278.0
\n", "
" ], "text/plain": [ " user device audio_count_loud \\\n", "2019-08-13 07:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1 \n", "2019-08-13 08:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1 \n", "2019-08-13 09:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1 \n", "2019-08-13 10:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2 \n", "2019-08-13 11:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2 \n", "\n", " audio_min_freq \n", "2019-08-13 07:00:00+03:00 7735.0 \n", "2019-08-13 08:00:00+03:00 7690.0 \n", "2019-08-13 09:00:00+03:00 756.0 \n", "2019-08-13 10:00:00+03:00 3059.0 \n", "2019-08-13 11:00:00+03:00 12278.0 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wrapper_features2 = {au.audio_count_loud:{\"audio_column_name\":\"decibels\",\"resample_args\":{\"rule\":\"1H\"}},\n", " au.audio_min_freq:{\"audio_column_name\":\"frequency\", \"resample_args\":{\"rule\":\"1H\"}}}\n", "results_wrapper = au.extract_features_audio(data, features=wrapper_features2)\n", "results_wrapper.head(5)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2244a071", "metadata": {}, "source": [ "Great! Another successful attempt. We see from the results that more columns were added with the required calculations. This is how the wrapper works when all features are computed with the same bins. Now, let's see how the wrapper performs when each function has different binning requirements. Let's assume we need to compute the audio_count_loud every day, and the audio_min_freq every 5 hours with an offset of 5 minutes." ] }, { "cell_type": "code", "execution_count": 21, "id": "4e80bfd0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdeviceaudio_count_loudaudio_min_freq
2019-08-13 00:00:00+03:00iGyXetHE3S8uCq9vueHh3zVs10.0NaN
2020-01-09 00:00:00+02:00jd9INuQ5BBlW3p83yASkOb_B7.0NaN
2020-01-09 00:00:00+02:00jd9INuQ5BBlWOWd1Uau8POix5.0NaN
2019-08-13 05:05:00+03:00iGyXetHE3S8uCq9vueHh3zVsNaN756.0
2019-08-13 10:05:00+03:00iGyXetHE3S8uCq9vueHh3zVsNaN2914.0
\n", "
" ], "text/plain": [ " user device audio_count_loud \\\n", "2019-08-13 00:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 10.0 \n", "2020-01-09 00:00:00+02:00 jd9INuQ5BBlW 3p83yASkOb_B 7.0 \n", "2020-01-09 00:00:00+02:00 jd9INuQ5BBlW OWd1Uau8POix 5.0 \n", "2019-08-13 05:05:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs NaN \n", "2019-08-13 10:05:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs NaN \n", "\n", " audio_min_freq \n", "2019-08-13 00:00:00+03:00 NaN \n", "2020-01-09 00:00:00+02:00 NaN \n", "2020-01-09 00:00:00+02:00 NaN \n", "2019-08-13 05:05:00+03:00 756.0 \n", "2019-08-13 10:05:00+03:00 2914.0 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wrapper_features3 = {au.audio_count_loud:{\"audio_column_name\":\"decibels\",\"resample_args\":{\"rule\":\"1D\"}},\n", " au.audio_min_freq:{\"audio_column_name\":\"frequency\", \"resample_args\":{\"rule\":\"5H\", \"offset\":\"5min\"}}}\n", "results_wrapper = au.extract_features_audio(data, features=wrapper_features3)\n", "results_wrapper.head(5)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c6563910", "metadata": {}, "source": [ "The output is once again a dataframe. In this case, two aggregations are shown. The first one is the daily aggregation computed for the `audio_count_loud` feature. The second one is the 5-hour aggregation period with 5-min offset for the `audio_min_freq`. We must note that because the `audio_min_freq`feature is not required to be aggregated daily, the daily aggregation timestamps have a NaN value. Similarly, because the `audio_count_loud`is not required to be aggregated in 5-hour windows, its values are NaN for all subjects. " ] }, { "attachments": {}, "cell_type": "markdown", "id": "8a960ee8", "metadata": {}, "source": [ "#### 4.2.3 Wrapper and its default option\n", "The default option will compute all features in 30-minute aggregation windows. To use the `extract_features_audio` function with its default options, simply call the function. " ] }, { "cell_type": "code", "execution_count": 22, "id": "daf215ac", "metadata": {}, "outputs": [], "source": [ "default = au.extract_features_audio(data, features=None)" ] }, { "cell_type": "code", "execution_count": 23, "id": "68a22b4e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdeviceaudio_count_silentaudio_count_speechaudio_count_loudaudio_min_freqaudio_max_freqaudio_mean_freqaudio_median_freqaudio_std_freqaudio_min_dbaudio_max_dbaudio_mean_dbaudio_median_dbaudio_std_db
2019-08-13 07:00:00+03:00iGyXetHE3S8uCq9vueHh3zVs0NaNNaN7735.07735.07735.07735.0NaN51.051.051.051.0NaN
2019-08-13 07:30:00+03:00iGyXetHE3S8uCq9vueHh3zVs0NaN1.013609.013609.013609.013609.0NaN90.090.090.090.0NaN
2019-08-13 08:00:00+03:00iGyXetHE3S8uCq9vueHh3zVs0NaN1.07690.07690.07690.07690.0NaN81.081.081.081.0NaN
2019-08-13 08:30:00+03:00iGyXetHE3S8uCq9vueHh3zVs0NaN0.08347.08347.08347.08347.0NaN58.058.058.058.0NaN
2019-08-13 09:00:00+03:00iGyXetHE3S8uCq9vueHh3zVs1NaN0.013592.013592.013592.013592.0NaN36.036.036.036.0NaN
\n", "
" ], "text/plain": [ " user device audio_count_silent \\\n", "2019-08-13 07:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 0 \n", "2019-08-13 07:30:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 0 \n", "2019-08-13 08:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 0 \n", "2019-08-13 08:30:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 0 \n", "2019-08-13 09:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1 \n", "\n", " audio_count_speech audio_count_loud \\\n", "2019-08-13 07:00:00+03:00 NaN NaN \n", "2019-08-13 07:30:00+03:00 NaN 1.0 \n", "2019-08-13 08:00:00+03:00 NaN 1.0 \n", "2019-08-13 08:30:00+03:00 NaN 0.0 \n", "2019-08-13 09:00:00+03:00 NaN 0.0 \n", "\n", " audio_min_freq audio_max_freq audio_mean_freq \\\n", "2019-08-13 07:00:00+03:00 7735.0 7735.0 7735.0 \n", "2019-08-13 07:30:00+03:00 13609.0 13609.0 13609.0 \n", "2019-08-13 08:00:00+03:00 7690.0 7690.0 7690.0 \n", "2019-08-13 08:30:00+03:00 8347.0 8347.0 8347.0 \n", "2019-08-13 09:00:00+03:00 13592.0 13592.0 13592.0 \n", "\n", " audio_median_freq audio_std_freq audio_min_db \\\n", "2019-08-13 07:00:00+03:00 7735.0 NaN 51.0 \n", "2019-08-13 07:30:00+03:00 13609.0 NaN 90.0 \n", "2019-08-13 08:00:00+03:00 7690.0 NaN 81.0 \n", "2019-08-13 08:30:00+03:00 8347.0 NaN 58.0 \n", "2019-08-13 09:00:00+03:00 13592.0 NaN 36.0 \n", "\n", " audio_max_db audio_mean_db audio_median_db \\\n", "2019-08-13 07:00:00+03:00 51.0 51.0 51.0 \n", "2019-08-13 07:30:00+03:00 90.0 90.0 90.0 \n", "2019-08-13 08:00:00+03:00 81.0 81.0 81.0 \n", "2019-08-13 08:30:00+03:00 58.0 58.0 58.0 \n", "2019-08-13 09:00:00+03:00 36.0 36.0 36.0 \n", "\n", " audio_std_db \n", "2019-08-13 07:00:00+03:00 NaN \n", "2019-08-13 07:30:00+03:00 NaN \n", "2019-08-13 08:00:00+03:00 NaN \n", "2019-08-13 08:30:00+03:00 NaN \n", "2019-08-13 09:00:00+03:00 NaN " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "default.head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "d0a40289", "metadata": {}, "source": [ "## 5. Implementing own features" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e2dfbcbe", "metadata": {}, "source": [ "If none of the provided functions suits well, We can implement our own customized features easily. To do so, we need to define a function that accepts a dataframe and returns a dataframe. The returned object should be indexed by user and timestamps (multiindex).\n", "Let's assume we need a new function that counts sums all frequencies. Let's first define the function" ] }, { "cell_type": "code", "execution_count": 24, "id": "839a0dee", "metadata": {}, "outputs": [], "source": [ "def audio_sum_freq(df, audio_column_name = \"frequency\", resample_args= {\"rule\":\"30T\"}):\n", " if len(df)>0:\n", " result = df.groupby('user')[audio_column_name].resample(**resample_args).sum()\n", " result = result.to_frame(name='audio_sum_freq')\n", " result = result.reset_index(\"user\")\n", " result.index.rename(\"datetime\", inplace=True)\n", " return result\n", " return None" ] }, { "attachments": {}, "cell_type": "markdown", "id": "07787017", "metadata": {}, "source": [ "Then, we can call our new function in the stand-alone way or using the `extract_features_audio` function. Alternatively, we can pass the feature function to the wrapper. Let's read again the data and assume we want the default behavior of the wrapper. " ] }, { "cell_type": "code", "execution_count": 25, "id": "150945da", "metadata": {}, "outputs": [], "source": [ "customized_features = au.extract_features_audio(data, features={audio_sum_freq: {}})" ] }, { "cell_type": "code", "execution_count": 26, "id": "4d4bd7e4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
useraudio_sum_freq
datetime
2019-08-13 07:00:00+03:00iGyXetHE3S8u7735
2019-08-13 07:30:00+03:00iGyXetHE3S8u13609
2019-08-13 08:00:00+03:00iGyXetHE3S8u7690
2019-08-13 08:30:00+03:00iGyXetHE3S8u8347
2019-08-13 09:00:00+03:00iGyXetHE3S8u13592
\n", "
" ], "text/plain": [ " user audio_sum_freq\n", "datetime \n", "2019-08-13 07:00:00+03:00 iGyXetHE3S8u 7735\n", "2019-08-13 07:30:00+03:00 iGyXetHE3S8u 13609\n", "2019-08-13 08:00:00+03:00 iGyXetHE3S8u 7690\n", "2019-08-13 08:30:00+03:00 iGyXetHE3S8u 8347\n", "2019-08-13 09:00:00+03:00 iGyXetHE3S8u 13592" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customized_features.head()" ] } ], "metadata": { "kernelspec": { "display_name": "niimpy", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.6" } }, "nbformat": 4, "nbformat_minor": 5 }