What is the most difficult problem that you have fixed in linux?

Waffelson@lemmy.world · 8 months ago

What is the most difficult problem that you have fixed in linux?

Random Dent@lemmy.ml · 8 months ago

I have two, one is actually complicated and one was so obtuse that I never would have figured it out in a million years:

Actually complicated: I still don’t know how it happened, but somehow an update on Arch filled the boot partition with junk files, which then caused the kernel update to fail because of no disk space, which then kind of tanked the whole system. It took ages, but with a boot disk and chroot-ing back into the boot partition I eventually managed to untangle it all. I was determined to see it through and not reinstall.

Ridiculous: One day when using Ubuntu, the entire system went upside-down. As in, everything was working perfectly fine, but literally the screen was upside-down. After much Googling I had no luck figuring it out, then I accidentally found the solution - I’d plugged a PS4 controller into the USB on the laptop to charge it, and for some reason Ubuntu interpreted the gyroscope on the controller as “rotate the screen display” so when I moved it, the screen spun round. I only figured it out by accident when I plugged it back it and it spun back to normal lol.

0110010001100010@lemmy.world · 8 months ago

Ridiculous: One day when using Ubuntu, the entire system went upside-down. As in, everything was working perfectly fine, but literally the screen was upside-down. After much Googling I had no luck figuring it out, then I accidentally found the solution - I’d plugged a PS4 controller into the USB on the laptop to charge it, and for some reason Ubuntu interpreted the gyroscope on the controller as “rotate the screen display” so when I moved it, the screen spun round. I only figured it out by accident when I plugged it back it and it spun back to normal lol.

LMAO what the fuck?

AeroLemming@lemm.ee · edit-2 2 months ago

deleted by creator

Hadriscus@lemm.ee · 8 months ago

deleted by creator

mojo_raisin@lemmy.world · 8 months ago

This deserves some sort of funniest Linux problem award.

bruhbeans@lemmy.ml · 8 months ago

The controller thing is goddam hilarious

edric@lemm.ee · 8 months ago

Ridiculous

I had a similar one. I had a usb-powered fan cooling pad that my laptop was sitting on. My laptop would randomly go into boot loops when I turn it on. I thought it was a grub issue so I always had my usb stick ready to re-install grub. Did some dusting one day and forgot to plug in the cooling fan, then the boot loop never happened again. Turns out it was the fan plugged into the usb that was causing it.

foggy@lemmy.world · 8 months ago

I think this is likely related to USB cables as power cables and USB ports/voltages.

I have seen a lamp completely fry a MacBook. I wouldn’t be surprised to see something similar cause a boot loop.

curiousPJ@lemmy.world · 8 months ago

Semi-related note… displayport cables can cause a no-boot condition too. I think it was the existence of Pin#1. I had to duct tape that one pin and my computer finally booted up.

evidences@lemmy.world · 8 months ago

A couple years ago on Reddit I saw a story where a dude working IT support had to drive to a remote office or replace a workstation that wouldn’t boot. When he got there the lady whose desk it was had some shitty USB fan or maybe an led Christmas tree plugged into one of the USB ports. He unplugged that and the pc booted fine.

Hadriscus@lemm.ee · edit-2 8 months ago

This is up there with the ~~redacted~~ (just looked it up it’s called the 500-mile email)

Random Dent@lemmy.ml · 8 months ago

Ah I remember that one! Classic. I also remember a story about someone who lost an entire PC in their apartment. It was running and connected to the network, they could ping it, but couldn’t physically find it lol.

Hadriscus@lemm.ee · 8 months ago

😂 Please ping me if you find it (the story)…

Corr@lemm.ee · 8 months ago

This is a phenomenal read. Thank you for sharing lol

MentalEdge@sopuli.xyz · edit-2 8 months ago

I manage a machine that runs both media transcodes and some video game servers.

The video game servers have to run in real-time, or very close to it. Otherwise players using them suffer noticeable lag.

Achieving this at the same time that an ffmpeg process was running was completely impossible. No matter what I did to limit ffmpegs use of CPU time. Even when running it at lowest priority it impacted the game server processes running at top priority. Even if I limited it to one thread, it was affecting things.

I couldn’t understand the problem. There was enough CPU time to go around to do both things, and the transcode wasn’t even time sensitive, while the game server was, so why couldn’t the Linux kernel just figure it out and schedule things in a way that made sense?

So, for the first time I read up on how computers actually handle processes, multi-tasking and CPU scheduling.

As FFMPEG is an application that uses ALL available CPU time until a task is done, I came to the conclusion that due to how context switching works (CPU cores can only do one thing, they just switch out what they do really fast, but this too takes time) it was causing the system to fall behind on the video game processes when the system was operating with zero processing headroom. The scheduler wasn’t smart enough to maintain a real-time process in the face of FFMPEG, which would occupy ALL available cycles.

I learned the solution was core pinning. Manually setting processes to run on certain cores of the CPU. I set FFMPEG to use only one core, since it doesn’t matter how fast it completes. And I set the game processes to use all but that one core, so they don’t accidentally end up queueing for CPU time on a core that doesn’t have the headroom to allow the task to run within a reasonable time range.

This has completely solved the problem, as the game processes and FFMPEG no longer wait for CPU cycles in the same queue.

flambonkscious@sh.itjust.works · 8 months ago

Well that’s interesting… I’d have thought, possibly naively, that as long as a thread had work to do it would essentially behave like ffmpeg does?

Perhaps there’s something about the type of work though, that it’s very CPU-bound or something?

MentalEdge@sopuli.xyz · edit-2 8 months ago

I think the difference is simply that most processes only have a certain amount that needs accomplishing in a given unit of time. As long as they can get enough CPU time, and do so soon enough after getting in line for it, they can maintain real-time execution.

Very few workloads have that much to do for that long. But I would expect other similar workloads to present the same problem.

There is a useful stat which Linux tracks in addition to a simple CPU usage percentage. The “load average” represents the average number of processes that have requested CPU time, but have to queue for it.

As long as the number is lower than the available number of cores, this essentially means that whenever one process is done running a task, the next in line can get right on with theirs.

If the load average is less than the number of cores available, that means the cores have idle time where they are essentially just waiting for a process to need them for something. Good for time-sensitive processes.

If the load average is above the number of cores, that means some processes are having to wait for several cycles of other processes having their turn, before they can execute their tasks. Interestingly, the load average can go beyond this threshold way before the CPU hits 100% usage.

I found that I can allow my system to get up to a load average of about 1.5 times the number of cores available, before you start noticing it when playing on one of the servers I run.

And whenever ffmpeg was running, the load average would spike to 10-20 times the number of cores. Not good.

flambonkscious@sh.itjust.works · 8 months ago

That makes complete sense - if you’ve got something ‘needy’, as soon as it’s queuing up, I imagine it snowballs, too…

10-20 times the core count is crazy, but I guess it’s had a lot of development effort into parallelizing it’s execution, which of course goes against what your use case is :)

MentalEdge@sopuli.xyz · edit-2 8 months ago

Theoretically a load average could be as high as it likes, it’s essentially just the length of the task queue, after all.

Processes having to queue to get executed is no problem at all for lots of workloads. If you’re not running anything latency-sensitive, a huge load average isn’t a problem.

Also it’s not really a matter of parallelization. Like I mentioned, ffmpeg impacted other processes even when restricted to running in a single thread.

That’s because most other processes will do work in small chunks that complete within nanoseconds. Send a network request, parse some data, decode an image, poll HID device, etc.

A transcode meanwhile can easily have a CPU running full tilt for well over a second, working on just that one thing. Most processes will show up and go “I need X amount of CPU time” while ffmpeg will show up and go “give me all available CPU time” which is something the scheduler can’t actually quantify.

It’s like if someone showed up at a buffet and asked for all the food that no-one else is going to eat. How do you determine exactly how much that is, and thereby how much it is safe to give this person without giving away food someone else might’ve needed?

You don’t. Without CPU headroom it becomes very difficult for the task scheduler to maintain low system latency. It’ll do a pretty good job, but inevitably some CPU time that should have gone to other stuff, will go the process asking for as much as it can get.

Waffelson@lemmy.world · 8 months ago

This reminded me of how I disabled processor cores in Process Lasso for programs

Hyrulian@lemmy.world · 8 months ago

Around 2017 I spent three days on and off trying to diagnose why my laptop running elementary OS had no wifi support. I reinstalled the wifi drivers and everything countless times. It worked for many days initially then just didn’t one day when I got on the laptop. Turns out I had accidentally flipped the wifi toggle switch while it was in my bag. I forgot the laptop had one. Womp womp.

passepartout@feddit.de · 8 months ago

I had a friend come over to my place to fix her laptops wifi. After about an hour searching for any setting in windows that i could have missed, i coincidentally found a forum where one pointed out this could be due to a hardware wifi switch…

Hawke@lemmy.world · 8 months ago

Womp womp.

I used to bullseye womp rats in my T-16 back home, they’re not much bigger than 2 meters.

GravitySpoiled@lemmy.ml · 8 months ago

Grub.

Seriously. Tha was some fat as shit because I didn’t know what I was doing.

nul9o9@lemmy.world · 8 months ago

I broke my bootloader fucking with uefi settings. I was in a panic for a few hours because I hadn’t bothered to learn how that shit worked until then.

It sure was a relief when i got back into my system.

passepartout@feddit.de · 8 months ago

Bricked my pc twice because of the bootloader and couldn’t repair it. From now on i just nuke my system if something is fucky and have a shell script do the installing of packages etc.

teft@lemmy.world · 8 months ago

I once exited vim without having to look up the commands.

flambonkscious@sh.itjust.works · 8 months ago

I suppose it’s statistically inevitable, I just didn’t think it would happen in my lifetime

z00s@lemmy.world · 8 months ago

Truly you are a god amongst men

Treczoks@lemmy.world · 8 months ago

My first Linux machine crashing. This was way before Redhat, Ubuntu, Arch, or OpenSUSE. This was installed from 60+ floppy disks on a 386-40 with 8MB of RAM.

This machine ran happily, but it crashed under heavy load. I checked out causing the load by using different applications, but could not nail it to a certain software. So the next thing I checked was the RAM. Memtest86 ran for a day without any problems. But the crashes still came. So I got the infrared camera from the lab to see if some hardware overheats. Nope, this went nowhere, either.

Then I tested the harddisk. Read test of the whole HD went without problems. I copied the data on a backup medium and did a write and read test by dd’ing /dev/zero over the whole disk, and then dd’ing the disk to /dev/null. Nothing did show up.

I reinstalled the Linux, and it crashed again. But this time, I noticed that something was odd with the harddisk. I added a second swap partition, disabled the first, and the machine ran without problems. Strange…

So I wrote a small program that tested the part of the disk occupied by the old swap space: Write data, read data, and log everything with timestamps. And there was the culprit: There was an area on the HD where I could write any data, but when I read blocks from that area, a) It took a very long time for the read, b) the blocks I read were containing all zero, regardless of what I had written, and worst of all c) there was no error indication whatsoever from the controller or drive. Down at the kernel level, the zeroed blocks were happily served by the HD with an “OK”. And the faulty area was right in the middle of the original swap partition.

TheRadiatorIsWarm@sopuli.xyz · edit-2 8 months ago

Nice read! Did you delete the old swap space or left it as-is?

Treczoks@lemmy.world · 8 months ago

I took no risks and binned the disk. I wanted to buy a bigger one, anyway.

TPTheWiper@feddit.de · 8 months ago

lol nerd

Hadriscus@lemm.ee · 8 months ago

Are you saying this as a compliment ? it’s not completely clear. Either way, it is a compliment

Richard@lemmy.world · 8 months ago

Blocked

TPTheWiper@feddit.de · 8 months ago

Yes a compliment

ulterno@lemmy.kde.social · 8 months ago

If you were trolling, “Blocked” would definitely be a complement.

rowinxavier@lemmy.world · 8 months ago

Working for a VoIP company in the early 2010s I rm -rf’d the /bin/ directory. As root. On a production server. On site.

I ended up booting from my phone (android app for iso booting) then manually coppied over the files from another machine. Chrooted and some stuff was broken but rebuilding from the package manager reinstalled everything that was missing. Got the system back up in around 40 mins after that colossal screw up. Good fun and a great learning experience. Honestly, my manager should not have had me doing anything on a root shell with no training.

dejected_warp_core@lemmy.world · 8 months ago

I ended up booting from my phone (android app for iso booting)

Impressive. I had no idea that was a thing. That’s easily the most “Star Trek” sounding fix I’ve heard in a good while.

back up in around 40 mins […] on a root shell with no training.

… and you intuited that fix, or at least pulled it together from scratch/google with no training? Doubly impressive.

fossphi@lemm.ee · 8 months ago

You might enjoy this ;)

https://www.ee.torontomu.ca/~elf/hack/recovery.html

ChojinDSL@discuss.tchncs.de · 8 months ago

Around 2003-2004. I was still a bit of a Linux noob, just getting to grips with Gentoo.

Had two no-name WiFi adapters that weren’t directly supported under Linux. Found some obscure forum thread that mentioned them, along with which lines in which source code driver to change to make these adapters work.

mojo_raisin@lemmy.world · 8 months ago

Wow nice one! I don’t think anyone outside of Gentoo or LFS would even go there.

Maxxus@sh.itjust.works · 8 months ago

Maybe this goes a bit deeper than the question intended, but I’ve made and shared two patches that I had to apply locally for years before they were merged into the base packages.

The first was a patch in 2015 for SDL2 to prevent the Sixaxis and other misbehaving controllers to not use uninitialized axes and overwrite initialized ones. Merged in 2018.

The second was a patch in the spring of 2021 for Xft to not assume all the glyphs in a monospaced font to be the same size. Some fonts have ligatures which are glyphs that represent multiple characters together, so they’re actually some multiple of the base glyph size. Merged in the fall of 2022.

bane_killgrind@lemmy.ml · 8 months ago

How dare you science in a kvetching discussion

BlueDwaggin@pawb.social · 8 months ago

Getting WiFi to work in 2003

Too Lazy Didn't Name@lemmy.world · 8 months ago

For me, it was getting WiFi to work in 2023

TimeSquirrel@kbin.social · 8 months ago

NDISWrapper: we’re just gonna trick the Windows driver into thinking it’s running on Windows and intercept the system calls.

That was certainly an era.

hardaysknight@lemmy.world · 8 months ago

God what a nightmare that was

lightnegative@lemmy.world · 7 months ago

Oh god I remember that. Luckily in 2003 my main computer was scraped together from discarded parts at my father’s day job, so it was ethernet only

In 2024 on a laptop I still have wifi problems though. Most recently, if I closed and opened the laptop lid (suspend + resume), the wifi hardware just disappeared off the face of the kernel.

Turns out that the iwlwifi kernel module just irreversibly crashes when the laptop suspends and can only be fixed with a reboot.

So I had the fun task learning about systemd pre-suspend hooks to unload the driver before suspend and load it again on resume.

Turns out wifi drivers still suck in 2024

Diplomjodler@feddit.de · 8 months ago

Fixed a typo in my /etc/fstab that prevented the NAS from mounting. I am a bear of little brain. But I’m also proof that you don’t have to be some master hacker to successfully run Linux.

NaoPb@eviltoast.org · 8 months ago

This is something I’ve had to do a few times.

Saved me from reinstalling. Made me realise that there really should be an alternative to typing into fstab by hand since us humans will make mistake. Either that or make fstab nog crash completly on an error but just skip it.

johannesvanderwhales@lemmy.world · edit-2 8 months ago

Back in the day, I upgraded a Slackware install from kernel 1.3 to 2.0. That was a fucking adventure.

The fun part about back then was that if your machine wouldn’t boot or if you couldn’t get your modem or pppd working, you probably didn’t have another internet connected device so you might have to drive somewhere with a computer…or try to figure it out through books.

megabat@lemm.ee · 8 months ago

You probably remember the libc5 to glibc swap. Bad times to DIY distros.

johannesvanderwhales@lemmy.world · edit-2 8 months ago

Yep. I remember at the time I saw a lot of advice saying “you know you might want to seriously consider just installing your distro from scratch with a newer version.” Tracking down all of the dependencies (some of which had to be installed as binaries) was a very manual process.

Edit: Oh and another fun aspect of that time period was that since downloads were so slow on a modem, if you wanted a newer version or to try out another distro, you would go and order a cdrom from a place like Walnut Creek.

Swordgeek@lemmy.ca · 8 months ago

Not Linux, but Solaris, back in the day.

We had a system with a mirrored boot disk. One of the disks failed. And we were unable to boot from the other, because the boot device in OBP (~BIOS) pointed to a device-specific partitIon. When we manually booted from the live device, it was lacking the boot sector code, and wouldn’t boot. When we booted from CDROM, the partitions wouldn’t mount because the virtual device mapping pointed to the dead drive.

This was a gas futures trading system, and rebuild wasn’t an option. Restoring from backup woyld have lost four hours of trades, which would be an extreme last resort.

A coworker and I spent all night on the box. We had a whiteboard covered with every stage of the boot sequence broken down, and every redirection we needed to (a) boot and (b) repair the system. The issue started mid-afternoon, and we finally got it back up by around 6:30 am.

T4V0@lemmy.world · 8 months ago

Not a Linux problem per se, but I had a 128GB image disk in a unknown .bin format which belongs to a proprietary application. The application only ran on Windows.

I tried a few things but nothing except Windows based programs seemed able to identify the partitions, while I could run it in Wine, it dealt with unimplementend functions. So after a bit of googling and probing the file, it turns out the format had just a 512 bytes as header which some Windows based software ignored. After including the single block offset, all the tools used in Linux started working flawlessly.

Hadriscus@lemm.ee · 8 months ago

This is so arcane to me. Like, I more or less understand your high-level explanation, but then you gloss over “including the block offset” but how would one do that ??

DickFiasco@lemm.ee · 8 months ago

Inspecting the file with a hex editor would give you lots of useful info in this case. If you know approximately what the data should look like, you can just see where the garbage (header) ends and the data starts. I’ve reverse engineered data files from an oscilloscope like this.

T4V0@lemmy.world · edit-2 8 months ago

Well, in this scenario the image file had 512 bytes sections, each one is called a block. If you have a KiB (a kibibyte = 1024 bytes) it will occupy 2 blocks and so on…

Since this image file had a header with 512 bytes (i.e. a block) I could, in any of the relevant Linux mounting software (e.g. mount, losetup), choose an offset adding to the starting block of a partition. The command would look like this:

sudo mount -o loop,offset=$((header+partition)) img_file /mnt

33550336@lemmy.world · 8 months ago

quit vim