a2-questions.md
-
Vanshika Mohan Bongade authoredVanshika Mohan Bongade authored
Assignment 2 Questions
Directions
Please answer the following questions and submit in your repo for the second assignment. Please keep the answers as short and concise as possible.
-
In this assignment I asked you provide an implementation for the
get_student(...)
function because I think it improves the overall design of the database application. After you implemented your solution do you agree that externalizingget_student(...)
into it's own function is a good design strategy? Briefly describe why or why not.Answer: YES, It is a wise design decision to move the get_student(...) method into a different function. It makes the code more modular and reusable, which enhances its structure. It keeps all of the functionality in one location rather than repeating it in many database program sections, including retrieving student data for printing, removing, or updating. This ensures consistency anytime student data is requested and makes the code easier to maintain and debug.
-
Another interesting aspect of the
get_student(...)
function is how its function prototype requires the caller to provide the storage for thestudent_t
structure:int get_student(int fd, int id, student_t *s);
Notice that the last parameter is a pointer to storage provided by the caller to be used by this function to populate information about the desired student that is queried from the database file. This is a common convention (called pass-by-reference) in the
C
programming language.In other programming languages an approach like the one shown below would be more idiomatic for creating a function like
get_student()
(specifically the storage is provided by theget_student(...)
function itself)://Lookup student from the database // IF FOUND: return pointer to student data // IF NOT FOUND: return NULL student_t *get_student(int fd, int id){ student_t student; bool student_found = false; //code that looks for the student and if //found populates the student structure //The found_student variable will be set //to true if the student is in the database //or false otherwise. if (student_found) return &student; else return NULL; }
Can you think of any reason why the above implementation would be a very bad idea using the C programming language? Specifically, address why the above code introduces a subtle bug that could be hard to identify at runtime?
ANSWER: The function returns a pointer to a local variable (student), which is stored on the stack. Once the function completes, this memory is no longer valid, which can lead to unpredictable behavior or even program crashes. This issue, known as a dangling pointer, can be particularly difficult to debug because the memory may seem to work temporarily but could be overwritten by other operations in the program.
-
Another way the
get_student(...)
function could be implemented is as follows://Lookup student from the database // IF FOUND: return pointer to student data // IF NOT FOUND or memory allocation error: return NULL student_t *get_student(int fd, int id){ student_t *pstudent; bool student_found = false; pstudent = malloc(sizeof(student_t)); if (pstudent == NULL) return NULL; //code that looks for the student and if //found populates the student structure //The found_student variable will be set //to true if the student is in the database //or false otherwise. if (student_found){ return pstudent; } else { free(pstudent); return NULL; } }
In this implementation the storage for the student record is allocated on the heap using
malloc()
and passed back to the caller when the function returns. What do you think about this alternative implementation ofget_student(...)
? Address in your answer why it work work, but also think about any potential problems it could cause.ANSWER: This implementation avoids dangling pointers by using dynamic memory allocation, but it comes with the risk of memory leaks. If the caller does not explicitly free the allocated memory, it will accumulate over time, leading to unnecessary memory consumption and potential system slowdowns. A more effective approach would be to pass a pre-allocated student_t structure reference (as used in the assignment), ensuring proper memory management and reducing the risk of leaks. caller's responsibility.
-
Lets take a look at how storage is managed for our simple database. Recall that all student records are stored on disk using the layout of the
student_t
structure (which has a size of 64 bytes). Lets start with a fresh database by deleting thestudent.db
file using the commandrm ./student.db
. Now that we have an empty database lets add a few students and see what is happening under the covers. Consider the following sequence of commands:> ./sdbsc -a 1 john doe 345 > ls -l ./student.db -rw-r----- 1 bsm23 bsm23 128 Jan 17 10:01 ./student.db > du -h ./student.db 4.0K ./student.db > ./sdbsc -a 3 jane doe 390 > ls -l ./student.db -rw-r----- 1 bsm23 bsm23 256 Jan 17 10:02 ./student.db > du -h ./student.db 4.0K ./student.db > ./sdbsc -a 63 jim doe 285 > du -h ./student.db 4.0K ./student.db > ./sdbsc -a 64 janet doe 310 > du -h ./student.db 8.0K ./student.db > ls -l ./student.db -rw-r----- 1 bsm23 bsm23 4160 Jan 17 10:03 ./student.db
For this question I am asking you to perform some online research to investigate why there is a difference between the size of the file reported by the
ls
command and the actual storage used on the disk reported by thedu
command. Understanding why this happens by design is important since all good systems programmers need to understand things like how linux creates sparse files, and how linux physically stores data on disk using fixed block sizes. Some good google searches to get you started: "lseek syscall holes and sparse files", and "linux file system blocks". After you do some research please answer the following:-
Please explain why the file size reported by the
ls
command was 128 bytes after adding student with ID=1, 256 after adding student with ID=3, and 4160 after adding the student with ID=64?ANSWER: The logical file size, as determined by the file's topmost written byte, is displayed by the 'ls' command. The database file grows when students with IDs 1, 3, and 64 are added by aligning entries based on the student ID times the record size (64 bytes). Because of this, adding student ID 64 causes the file size to increase to 4160 bytes (64 * 65).
-
Why did the total storage used on the disk remain unchanged when we added the student with ID=1, ID=3, and ID=63, but increased from 4K to 8K when we added the student with ID=64?
ANSWER: Linux optimizes storage by using sparse files, meaning large unused portions of a file do not take up actual disk space. The database file only uses physical storage when new data is written. Until student ID 64, all entries fit within the initially allocated 4K block size. However, adding student ID 64 requires additional space, causing a second 4K block to be allocated, which increases disk usage.
-
Now lets add one more student with a large student ID number and see what happens:
> ./sdbsc -a 99999 big dude 205 > ls -l ./student.db -rw-r----- 1 bsm23 bsm23 6400000 Jan 17 10:28 ./student.db > du -h ./student.db 12K ./student.db
We see from above adding a student with a very large student ID (ID=99999) increased the file size to 6400000 as shown by
ls
but the raw storage only increased to 12K as reported bydu
. Can provide some insight into why this happened?ANSWER: The file's topmost byte jumps to 6400000 bytes (99999 * 64) when student ID 99999 is added. However, Linux does not use all of that disk space because of sparse file behavior. Rather, the filesystem treats unallocated sections as virtual zeroes and only stores actual written data (12K).
-