Fedora Data Engineering SIG: interested in a Fedora SIG to work on this?

Hi all,

I'm wondering if there are people who are working on data engineering is
interested in working on a SIG focusing on DE.

Currently my idea on this SIG would be:
1 - packaging data engineering related softwares into Fedora, and make them
easy to install, covering from workflow tools (eg: airflow, luigi), data
processing engines (eg: apache spark, flink), visualization tools
(superset, redash) and make life easier for that. I'm not sure how much
these tools can fit into fedora packaging guidelines (lots of bundled jars,
and users expects upstream binaries, esp on engines such as spark/flink),
which is something to brainstorm on.
2 - ambassador related activities around promoting fedora as a platform for
data engineers to use.

If interested, i'm on this telegram group:
Re: Fedora Data Engineering SIG: interested in a Fedora SIG to w

By Gerald Henriksen at 07/04/2019 - 11:36

I think this is likely a great idea, though I would advise serious
consideration before proceeding down the packaging of anything Java
related as you already indicate.

As you note, the users of Java software don't want packaged versions,
and when you combine that with the serious time commitments to even
attempt not just the initial packaging but the long term maintenance
you soon risk getting what Fedora has already seen as documented on
this list the last 6 months or so - packages being abandoned.

My reluctant policy these days is to use whatever the language
communities have set up to install anything beyond the basics, whether
it be Pip or Maven or whatever, as that just seems to be the way those
communities want things to work.

Thus I think a far better goal might be:

1) package only stuff that makes sense - ie. anything based on a
language that doesn't have its own package management system like C
based programs / libraries.

2) test - make sure that even when using Pip or others to install,
that things just work on Fedora so that anyone using or trying Fedora
gets a good experience.

3) document and promote, so that Fedora looks like a valid alternative
to the Ubuntu default that so many of these external software
developers default to. Nicely try and get Fedora added as an
additional mention in any 3rd party documenation that assumes Ubuntu
or any other Linux distribution.

Re: Fedora Data Engineering SIG: interested in a Fedora SIG to w

By Ankur Sinha at 07/04/2019 - 06:56

On Thu, Jul 04, 2019 12:41:26 +0800, Mohd Izhar Firdaus Ismail wrote:

Long time no see!!

While I'm not directly in data engineering/science, this is now a very
very important component of the general scientific pipeline. So, I'd be
interested to help out, and I'm sure the rest of the NeuroFedora team
would be too. It'll probably also be beneficial to rope in the sci-tech
folks, and any other folks that were part of the data science/machine
learning SIGS (not sure if the SIGs are active). I also expect the
Astronomy SIG would be interested in some tools, since they also do deal
with rather large amounts of data. If we can bring more of these SIGs
together to help each other, we'll get a lot more work done :)

We're meeting at Flock to discuss how we can leverage the excellent
resources that the Fedora community provides to enable Free Science.
Please do join the discussion too:

(I should probably send out a separate e-mail about this).

Re: Fedora Data Engineering SIG: interested in a Fedora SIG to w

By Gerald Henriksen at 07/04/2019 - 19:35

Perhaps then the best solution is to document it, and then the SIG if
created can do blog posts, videos, etc. to promote using Fedora using
a documented procedure that works?

Re: Fedora Data Engineering SIG: interested in a Fedora SIG to w

By Izhar Firdaus at 07/04/2019 - 23:23

I think documentation alone is not enough to make things easy, as to make
the software better integrate with Fedora would require some more
additional work (eg: systemd integration, quick painless installation,
prebuilt binaries) ..

What about docker images, or vendor-rpm, or automated install scripts
approach?. Would that work?.

Re: Fedora Data Engineering SIG: interested in a Fedora SIG to w

By Gerald Henriksen at 07/08/2019 - 11:23

Not if it goes against what the users expect, as it doesn't matter if
your solution is superior if it is too different than what the
community expects / is used to.

Speaking generalities.

My position has evolved, and I have now taken the position that if a
language (like Python) has a built in infrastructure for package
installation I no longer install any Fedora packages beyond the basics
(ie the compiler/interpreter).

Whether it is good or bad, it is no longer worth fighting those
communities and instead I follow their "best practices" and use their
package systems.

You obviously can decide otherwise.

But based on the above, my advice is to see how the communities
operate and find out how best to make Fedora work for those

For example, anything that uses the JVM it is likely the only thing
that will install from Fedora is OpenJDK - the communities built
around Java will not use distribution packaged versions of the
software, preferring to install via direct downloads or Maven.

Similiarly with Python, every blog post, video, or book states to do
"pip install ..." and it doesn't matter if an RPM is better integrated
into Fedora as few will go against the community.

Obviously there are exceptions, like anything written in C where they
don't (yet) have their own packaging system and so that stuff likely
should be packaged.

Which comes back to my original post suggesting documentation, as it
isn't so much about making things easy as just making the vast
majority of potential users aware that there are other Linux
alternatives other than say Ubuntu, which seems to dominate that
existing mindset of blog posts and other documentation.

Re: Fedora Data Engineering SIG: interested in a Fedora SIG to w

By Izhar Firdaus at 07/14/2019 - 09:48

Hmm ..

I think I'll do something in between, .. a set of documentation on how to
install and run the softwares , and a set of rpms which contains Fedora
integration (systemd service, profile.d files, wrapper scripts, etc) to
make things works better with Fedora if the user followed the convention
provided by the docs. Which I think it's somewhat similar with the approach
of some existing foss game engine packages which requires separate, manual
download of data files / proprietary components.

Thanks for the feedback!