diff --git a/A2/a2-questions.md b/A2/a2-questions.md new file mode 100644 index 0000000000000000000000000000000000000000..ae683a1693c8dd15020917f1c3d9b585ed1f8cff --- /dev/null +++ b/A2/a2-questions.md @@ -0,0 +1,137 @@ +## Assignment 2 Questions + +#### Directions +Please answer the following questions and submit in your repo for the second assignment. Please keep the answers as short and concise as possible. + +1. In this assignment I asked you provide an implementation for the `get_student(...)` function because I think it improves the overall design of the database application. After you implemented your solution do you agree that externalizing `get_student(...)` into it's own function is a good design strategy? Briefly describe why or why not. + + > **Answer**: Yes, externalizing the get_student(...) function is a good design strategy because it encapsulates the logic for locating and validating a student record in one place. This improves modularity, reduces code duplication, and makes the application easier to maintain and test. + + +2. Another interesting aspect of the `get_student(...)` function is how its function prototype requires the caller to provide the storage for the `student_t` structure: + + ```c + int get_student(int fd, int id, student_t *s); + ``` + + Notice that the last parameter is a pointer to storage **provided by the caller** to be used by this function to populate information about the desired student that is queried from the database file. This is a common convention (called pass-by-reference) in the `C` programming language. + + In other programming languages an approach like the one shown below would be more idiomatic for creating a function like `get_student()` (specifically the storage is provided by the `get_student(...)` function itself): + + ```c + //Lookup student from the database + // IF FOUND: return pointer to student data + // IF NOT FOUND: return NULL + student_t *get_student(int fd, int id){ + student_t student; + bool student_found = false; + + //code that looks for the student and if + //found populates the student structure + //The found_student variable will be set + //to true if the student is in the database + //or false otherwise. + + if (student_found) + return &student; + else + return NULL; + } + ``` + Can you think of any reason why the above implementation would be a **very bad idea** using the C programming language? Specifically, address why the above code introduces a subtle bug that could be hard to identify at runtime? + + > **ANSWER:** The main problem is that the function returns a pointer to a local variable (student) allocated on the stack. Once the function returns, that memory is reclaimed, so the pointer becomes a dangling pointer. Accessing it later results in undefined behavior, which can lead to intermittent and hard-to-debug runtime errors. + + +3. Another way the `get_student(...)` function could be implemented is as follows: + + ```c + //Lookup student from the database + // IF FOUND: return pointer to student data + // IF NOT FOUND or memory allocation error: return NULL + student_t *get_student(int fd, int id){ + student_t *pstudent; + bool student_found = false; + + pstudent = malloc(sizeof(student_t)); + if (pstudent == NULL) + return NULL; + + //code that looks for the student and if + //found populates the student structure + //The found_student variable will be set + //to true if the student is in the database + //or false otherwise. + + if (student_found){ + return pstudent; + } + else { + free(pstudent); + return NULL; + } + } + ``` + In this implementation the storage for the student record is allocated on the heap using `malloc()` and passed back to the caller when the function returns. What do you think about this alternative implementation of `get_student(...)`? Address in your answer why it work work, but also think about any potential problems it could cause. + + > **ANSWER:** This alternative implementation works because it allocates the student record on the heap, ensuring that the returned pointer remains valid after the function returns. The caller can then use the pointer to access the data without worrying about the data disappearing when a function exits. +However, there are potential drawbacks: +Memory Management Responsibility: The caller now must remember to free the allocated memory. Failing to do so can lead to memory leaks, especially if get_student(...) is called frequently. +Performance Overhead: Dynamic allocation (using malloc()) incurs extra overhead compared to using pre-allocated or stack memory. In performance-critical code, this might be a concern. +Error Handling: If malloc() fails, the function must handle that error, and the caller must also be prepared to deal with a NULL pointer, adding complexity to the code. + + + +4. Lets take a look at how storage is managed for our simple database. Recall that all student records are stored on disk using the layout of the `student_t` structure (which has a size of 64 bytes). Lets start with a fresh database by deleting the `student.db` file using the command `rm ./student.db`. Now that we have an empty database lets add a few students and see what is happening under the covers. Consider the following sequence of commands: + + ```bash + > ./sdbsc -a 1 john doe 345 + > ls -l ./student.db + -rw-r----- 1 bsm23 bsm23 128 Jan 17 10:01 ./student.db + > du -h ./student.db + 4.0K ./student.db + > ./sdbsc -a 3 jane doe 390 + > ls -l ./student.db + -rw-r----- 1 bsm23 bsm23 256 Jan 17 10:02 ./student.db + > du -h ./student.db + 4.0K ./student.db + > ./sdbsc -a 63 jim doe 285 + > du -h ./student.db + 4.0K ./student.db + > ./sdbsc -a 64 janet doe 310 + > du -h ./student.db + 8.0K ./student.db + > ls -l ./student.db + -rw-r----- 1 bsm23 bsm23 4160 Jan 17 10:03 ./student.db + ``` + + For this question I am asking you to perform some online research to investigate why there is a difference between the size of the file reported by the `ls` command and the actual storage used on the disk reported by the `du` command. Understanding why this happens by design is important since all good systems programmers need to understand things like how linux creates sparse files, and how linux physically stores data on disk using fixed block sizes. Some good google searches to get you started: _"lseek syscall holes and sparse files"_, and _"linux file system blocks"_. After you do some research please answer the following: + + - Please explain why the file size reported by the `ls` command was 128 bytes after adding student with ID=1, 256 after adding student with ID=3, and 4160 after adding the student with ID=64? + + > **ANSWER:** Student with ID=1: + The student record is stored at offset 1×64=641 \times 64 = 641×64=64 bytes and occupies bytes 64–127. Thus, the logical file size becomes 128 bytes. +Student with ID=3: +The record is stored at offset 3×64=1923 \times 64 = 1923×64=192 bytes (occupying 192–255). The file now logically extends to 256 bytes. +Student with ID=64: +The record is stored at offset 64×64=409664 \times 64 = 409664×64=4096 bytes (occupying 4096–4159). The file size now is 4160 bytes. +The ls command shows the “logical” file size, which is determined by the highest written byte (plus one). + + + - Why did the total storage used on the disk remain unchanged when we added the student with ID=1, ID=3, and ID=63, but increased from 4K to 8K when we added the student with ID=64? + + > **ANSWER:** Linux filesystems use fixed-size blocks (commonly 4K) and support sparse files. When you write records at IDs 1, 3, and 63, the file’s logical size grows, but only the blocks that actually contain data are physically allocated. All the unwritten “holes” between these records do not consume disk space. When student ID=64 is added, its record falls into a new 4K block (starting at byte 4096), causing an additional block to be allocated. This increases the actual disk usage from 4K to 8K. + + + - Now lets add one more student with a large student ID number and see what happens: + + ```bash + > ./sdbsc -a 99999 big dude 205 + > ls -l ./student.db + -rw-r----- 1 bsm23 bsm23 6400000 Jan 17 10:28 ./student.db + > du -h ./student.db + 12K ./student.db + ``` + We see from above adding a student with a very large student ID (ID=99999) increased the file size to 6400000 as shown by `ls` but the raw storage only increased to 12K as reported by `du`. Can provide some insight into why this happened? + + > **ANSWER:** Student with ID=99999 is stored at an offset of 99999×64≈6,399,936 bytes, and with its record the file’s logical size becomes about 6,400,000 bytes. However, almost the entire file is “hole” (regions that were never written to). The filesystem does not allocate disk blocks for these holes, so despite the large logical size, only the few blocks that actually contain data (e.g., the ones for the few records we added) are physically allocated. Hence, du shows only about 12K of disk usage.